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PREFACE 


On May 27-31, 1985, a series of symposia was held at The University 
of Western Ontario, London, Canada, to celebrate the 70th birthday of Pro¬ 
fessor V. M. Joshi. These symposia were chosen to reflect Professor Joshi’s 
research interests as well as areas of expertise in statistical science among 
faculty in the Departments of Statistical and Actuarial Sciences, Economics, 
Epidemiology and Biostatistics, and Philosophy. 

From these symposia, the six volumes which comprise the “Joshi 
Festschrift” have arisen. The 117 articles in this work reflect the broad 
interests and high quality of research of those who attended our conference. 
We would like to thank all of the contributors for their superb cooperation 
in helping us to complete this project. 

Our deepest gratitude must go to the three people who have spent so 
much of their time in the past year typing these volumes: Jackie Bell, Lise 
Constant, and Sandy Tamowski. This work has been printed from “camera 
ready” copy produced by our Vax 785 computer and QMS Lasergraphix 
printers, using the text processing software TEX. At the initiation of this 
project, we were neophytes in the use of this system. Thank you, Jackie, Lise, 
and Sandy, for having the persistence and dedication needed to complete this 
undertaking. 

We would also like to thank Maria Hlawka-Lavdas, our systems analyst, 
for her aid in the layout design of the papers and for resolving the many 
difficult technical problems which were encountered. Nancy Nuzum and Elly 
Pakalnis have also provided much needed aid in the conference arrangements 
and in handling the correspondence for the Festschrift. 

Professor Robert Butts, the Managing Editor of The University of West¬ 
ern Ontario Series in Philosophy of Science has provided us with his advice 
and encouragement. We are confident that the high calibre of the papers in 
these volumes justifies his faith in our project. 

In a Festschrift of this size, a large number of referees were needed. 
Rather than trying to list all of the individuals involved, we will simply say 
“thank you” to the many people who undertook this very necessary task for 
us. Your contributions are greatly appreciated. 

Financial support for the symposia and Festschrift was provided by The 
University of Western Ontario Foundation, Inc., The University of Western 
Ontario and its Faculties of Arts, Science, and Social Science, The UWO 
Statistical Laboratory, and a conference grant from the Natural Sciences 

xiii 
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PREFACE 


and Engineering Research Council of Canada. Their support is gratefully 
acknowledged. 

Finally, we would like to thank Professor Joshi for allowing us to hold the 
conference and produce this Festschrift in his honor. Professor Joshi is a very 
modest man who has never sought the limelight. However, his substantial 
contributions to statistics merit notice (see Volume I for a bibliography of 
his papers and a very spiffy photo). We hope he will accept this as a tribute 
to a man of the highest integrity. 



INTRODUCTION TO VOLUME V 


The discipline of biostatistics has rapidly expanded over the last fifteen 
years, a period which has seen many new methods developed as well as 
increasing insight gained into older methods. The sixteen papers in this 
volume reflect both of these trends very well. 

The lead paper by Lawless and Singhal discusses and evaluates the rise 
of regression methodology, including proportional hazards and accelerated 
failure time models, to explore large medical data bases. An example is given 
of the application of these methods to a data base containing information 
on ovarian carcinoma patients. 

The paper by Ciampi, Chang, Hogg and McKinney also deals with ex¬ 
ploratory data analysis, discussing the application of Recursive Partition, 
a clustering method, to the determination of prognostic strata, propensity 
strata and treatment response strata in clinical and epidemiological stud¬ 
ies. They also tackle the general problem of evaluating the statistics of 
exploratory analyses. 

The third paper in this volume, by Chapman, Etezadi-Amoli, Selby, 
Boyd and Dailey, reports on the use of linear analogue scales in assessing 
the quality of life of cancer patients. The report is based on the results 
of a study in which the properties of these scales were investigated after 
administration to a large sample of breast cancer patients. 

Krewski, Smythe and Colin discuss the use of historical control infor¬ 
mation to help increase the sensitivity 1 of tests for increasing trend in tumor 
occurrence with increasing dose. They propose a two-stage procedure in 
which the historical control data is used only if the concurrent control re¬ 
sponse rate falls within a suitable tolerance interval defined in terms of the 
historical control distributions. 

The next group of four papers deals with inferences concerning odds 
ratios in one or more sets of 2x2 contingency tables. Walter considers and 
evaluates the performance of alternative point estimators of a single odds 
ratio and its logarithm in 2 X 2 tables with small cell frequencies, while Wells 
and Donner develop formulas for the bias and mean square error of the logit 
estimator from a single 2x2 table. The two papers by Hauck and by Donald 
and Donner deal with multiple 2x2 contingency tables, focussing on the 
estimation of a common odds ratio. Hauck provides a review of various point 
and interval estimators for the situation, while Donald and Donner discuss 
the effect of clustering on the analyses of multiple contingency tables. 


XV 
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Mantel and Paul also focus on the analyses of categorical data, cri¬ 
tiquing the use of standard goodness-of-fit tests in toxicological experiments 
involving litters of varying age. 

The papers by Bull and Donner and by Koval and Donner deal with 
logistic regression analyses, a methodology which has become extremely pop¬ 
ular among biostatisticians in recent years. Bull and Donner derive the large 
sample relative efficiency of multinomial logistic regression, with an arbitrary 
number of response groups, to multiple group discriminant analyses. Koval 
and Donner propose a model to incorporate clustering, as found in family 
data, for example, into logistic estimation. 

Konishi and Gupta discuss inferences concerning interclass and intra¬ 
class correlation from familial data. They review the existing literature and 
propose a new procedure for testing the equality of intraclass correlations in 
two multivariate normal populations. 

The next two papers focus on the topic of estimation in bioassay. 
Viveros and Sprott present an alternative to the Fieller method for obtaining 
approximate confidence intervals for LD50 and LD90; their method is based 
on transforming the parameters of interest. Walsh, Hubert and Carter con¬ 
tinue their recent work in multivariate bioassay. They also propose a new 
test for “parallelism” for univariate symmetric parabolic bioassays. 

Shoukri and Consul present an overview of the generalized Poisson dis¬ 
tribution, and illustrate the generation of this model with a number of ex¬ 
amples from biology and medicine. 

The final paper ends this volume on a high note. Federer and Murty dis¬ 
cuss the application of multivariate analyses for intercropping experiments. 
Dr. Federer is one of the world's leading experts in the field of experimental 
design, and we wish to congratulate him on the recent Festschrift for his 
retirement. However, he tells us that he will still be very active in research, 
and we look forward to his further contributions to statistical science. 



J. F. Lawless and K. Singhal 1 


REGRESSION METHODS AND THE EXPLORATION 
OF LARGE MEDICAL DATA BASES 

1. INTRODUCTION 

Many moderate to large size data bases are assembled in connection with 
the study and treatment of chronic diseases. Examples are data bases related 
to the diagnosis and treatment of patients with various forms of cancer (e.g., 
Dembo et al., 1982; Ciampi et al., 1981) or coronary heart disease (e.g., Mock 
et al., 1982), or those related to prospective studies, for example on the risk 
of coronary heart disease (Keys et al., 1972). Such data bases typically 
contain survival or other event-history information about individuals, along 
with information on other pertinent variables. Our objective in this paper 
is to review the use of regression methods in exploring them. 

We consider problems in which the relationship of some primary vari¬ 
able to other explanatory variables is of interest. Following usual regression 
terminology, we will refer to the primary variable as the response variable 
and the others as explanatory variables or covariates. We assume that the 
data base may have information on a fairly large number of individuals 
(somewhere between a hundred and many thousand is common) and that 
there may be a large number of explanatory variables possibly related to 
a given primary variable. Although sometimes all or part of the data may 
have arisen from controlled randomized trials, the data are often observa¬ 
tional. Consequently we focus on rather general statistical objectives related 
to exploring and describing the structure of the data. These include 

(i) assessment of which explanatory variables appear to be related to the 
response variable, and description of the nature of the relationship, 

(ii) summarization and description of salient features of the data. 

These are of course very broad objectives, about which we say more in fol¬ 
lowing sections. 


1 Department of Statistics and Actuarial Science, University of Waterloo, Wa¬ 
terloo, Ontario N2L 3G1 (both authors) 

1 

I. B. MacNeill and G. J. Umphrey (eds.), Biostatistics, 1-22. 

© 1987 by D. Reidel Publishing Company. 



2 


J. F. LAWLESS AND K. SINGHAL 


To use regression methods it is necessary to adopt models or paradigms 
to guide the analysis. In Section 2 we outline strategies which we have 
found useful when the response variable is a (continuous) survival or response 
time. We mention, in Section 6, situations where the response variable is 
categorical, but to conserve space do not discuss them at length. We do note 
that a large proportion of practical problems can be addressed using either 
continuous or discrete-response regression models. 

Section 3 reviews methods for fitting, screening and criticising mod¬ 
els; both important recent developments and gaps in current knowledge are 
noted. We discuss software in Section 4, including a package written by K. 
Singhal (1985) which provides flexible methodology for the regression analy¬ 
sis of lifetime and categorical response data. In Section 5 we illustrate points 
made earlier by examining data on the survival of patients with ovarian car¬ 
cinoma. Section 6 concludes with a few additional remarks. 


2. SURVIVAL TIME REGRESSION MODELS 

We consider regression models for which the response variable Y is a 
survival time and x = (®i,..., Xk ), is a vector of explanatory variables. The 
conditional survivor function (SF), probability density function (PDF), and 
hazard function (HF) of Y given x respectively are denoted by 


S(y | x) = Pr(F > y | x), 
fiv I x ) = ~dS(y | x)/dy, 

and 

Myl x ) = /(yl x ) /%l x )- 

It is easily verified that S(y | x) = exp(— / Q v h(u \ x)du), which we note for 
future reference. 

In the regression analysis of survival time data two main paradigms are 
used often: these are the so-called proportional hazards (PH) and accelerated 
failure time (AFT) families of models. We next outline these and comment 
on their features. 

2.1 Proportional Hazards and Accelerated Failure Time Models 

Proportional hazards models assume that x affects the distribution of Y 
by having a multiplicative effect on the HF. In other words, h(y | x) can be 
expressed as h 0 (y)g(x), where <jf(x) is a positive-valued function and h 0 (y) is 
a baseline HF corresponding to an x for which (/(x) = 1. Equivalently, the 
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SF of Y given x is expressible as 

S{y\x) = S 0 (y)'W, (2.1) 

where S 0 (y) = exp(— h 0 (s)ds) is the baseline SF for an individual with 
y(x) = 1. In general one or both of h 0 (y) and y(x) may involve unknown 
parameters, and there are two main approaches to statistical analysis with 
PH models. One due to Cox (1972) is semi-parametric: it does not assume 
any particular form for h Q (y) or S^y). It yields parametric inference proce¬ 
dures for parameters in y(x) and nonparametric estimates of h 0 (y) or S Q (y). 
Kalbfleisch and Prentice (1980, ch. 4), Lawless (1982, ch. 7) and Cox and 
Oakes (1984) give detailed discussions. The other approach is to assume a 
parametric form for both y(x) and h Q (y) and to use standard parametric 
inference methods. 

In contrast with PH models, AFT models suppose that x affects the 
distribution of Y by acting multiplicatively on the time scale. That is, 
S(y | x) can be expressed in the form 

S{y | x) = S 0 (g(x)y), (2.2) 

where, as before, y(x) is a positive-valued function and S 0 (y) is a baseline 
SF. Accelerated failure time models are location-scale models on the log time 
scale: if W = log Y then the distribution of W given x can be expressed as 

W = - log j(x)+e, (2.3) 

where e is a random variable whose distribution does not depend on x. The 
common approach to AFT analysis is to use a parametric model where S Q (y) 
or in other words e in (2.3) is taken to have a given form. The lognormal and 
Weibull models, wherein e has a normal and an extreme value distribution, 
respectively, are two often-used models. Semi-parametric analyses which do 
not make parametric assumptions about S Q (y) or e are also possible (e.g., 
Kalbfleisch and Prentice 1980, ch. 6; Louis, 1981) but are computationally 
forbidding and not widely used at present. 

2.2 Some Parametric Models 

An approach we find useful is to use a fairly flexible class of parametric 
AFT models along with the Cox semi-parametric PH methods. In so doing 
we tend to work with the convenient and widely used form g (x) = exp (x'^8) in 
(2.1) and (2.2), where /9 = (/?i,.. .,/?*.), is a vector of regression coefficients. 
Methods discussed below also apply to other parametric forms for y(x) but 
for convenience we shall restrict discussion to this one. 
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Several families of AFT models possess similar properties and tend to 
fit survival data, and which ones to work with is, in part, a matter of taste. 
We prefer Weibull and log logistic models, and a Burr model that includes 
both of them as special cases, but we also use other models on occasion. The 
Weibull model (e.g., Lawless, 1982, ch. 6) can be expressed as 

S(y | x) = exp[—0(x)y p ], y > 0, (2.4) 

where p > 0 and 8{x) = exp(o: + x'jS). The corresponding HF is h(y | x) = 
py p ~ 1 0(x), and we observe that (2.4) is both an AFT and a PH model. It is 
in fact the unique model with this property. Note also that the distribution 
of W = log Y can be expressed as 


W = —cto — x!fie + as, — oo < e < oo, 


where a = p"" 1 and e has a standard extreme value distribution with PDF 
exp(e - e e ). 

The log logistic AFT model (e.g., Bennett, 1983a) has SF 

S{y | x) = [l + 0(x)y p ] _1 , y > 0, (2.5) 

where p > 0 and 0(x) = exp(a + x'/S). It is also expressible as 


W = —oca — x f /3a + as, —oo < s < oo, 

where W — log Y, a = p -1 and s has a standard logistic distribution with 
PDF e c /(l + e e ) 2 . The HF for this model is 

^ py p ~ 1 6(x) 

1 + 0{x)yP 

and is nonmonotonic in y for p > 1: it is easily seen that h(y | x) increases 
up to [(p - l)/0(x)] 1 / p then decreases thereafter. This value of y is the 
(p — l)/pth quantile, independent of x; for medical survival data p is often 
between one and two, so the decrease starts to the left of the median. For 
p < 1, h(y | x) is decreasing for all y > 0. 

A third distribution we find useful is the Burr XII model (e.g., Tadika- 
malla, 1980) with SF 


5(y|x) = [l + Ay^(x)]^ i - 1 , y > 0, 


( 2 . 6 ) 


where p > 0, A > 0 and 0(x) = exp(a + x'f}). This family includes the 
log logistic (A = 1) and Weibull (limit as A —► 0) as special cases. The 
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additional parameter A provides extra flexibilty and the distribution can 
also be used to discriminate parametrically between the Weibull and log 
logistic distributions. 

2.3 Discussion 

The PH and AFT paradigms are flexible and convenient for survival 
data analysis, and in many situations allow a succinct representation of the 
approximate effect of explanatory variables on the distribution of survival 
time. In one (AFT) the ratio of the pth quantiles of survival time for two 
individuals with arbitrary covariate vectors Xi and X 2 is constant for all 
p (0 < p < 1), and in the other (PH) the ratio of the hazard functions 
h(y | Xi) and h(y | X 2 ) is constant for all y > 0. The PH model has been 
found realistic and useful in many areas, for example in clinical trial work and 
in epidemiological investigations of disease incidence and mortality. For the 
kinds of survival times associated with many chronic disease data bases, we 
often find the PH model less compelling. When the population is heteroge¬ 
neous and there is considerable variation in survival times, the assumption 
of constant proportional hazards at all times y often seems unreasonable, 
and AFT or modified PH-type models often fit the data better. Gore et 
al . (1984) have given an excellent discussion of this and related points in 
connection with breast cancer survival, and we discuss in Section 5 some 
experiences with ovarian cancer survival. 

We usually fit both PH and AFT models in a given situation, as a first 
step. With the PH family we use the semi-parametric Cox analysis partly 
because of its flexibilty in allowing HF’s of different shapes; we deem this 
important because in large heterogeneous data sets there is often evidence 
of non-monotonic HF’s. With the AFT family we tend to fit members of 
the Burr distribution (2.6), including the Weibull and log logistic models. 
The Weibull is of course also a PH model with, however, monotonic hazard 
functions. The Burr family is flexible and easily handled statistically and 
computationally. For A > 0 it allows a wide range of hazard functions that 
increase to a point and then decrease, and for the Weibull case A = 0 it 
gives monotonic hazard functions. For the PH model the flexibility and sim¬ 
plicity of the Cox analysis make it less urgent to develop more flexible fully 
parametric models, and we have not systematically explored this direction. 

The PH and AFT models make strong assumptions about the effect 
of x and of course there are situations were neither model is reasonable. 
One type of departure from each model deserves special scrutiny. For AFT 
models it is assumed that the shape of the distribution of log Y does not 
change with x, only the location. In particular, the dispersion is a constant 
and one should thus be on the lookout for heteroscedasticity in log Y (e.g., 
Cook and Weisberg, 1983). For the PH model the hazard function ratios are 
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constant for any two x’s, and one should watch for departures from this (e.g, 
Andersen, 1982). When departures of the kinds noted are serious, extensions 
to the PH or AFT models frequently provide a satisfactory representation 
of the data. 

Other families of models can sometimes be used to good effect. For 
example, Bennett (1983b) has discussed the proportional odds (PO) family, 
in which log {S(y | x)/[l - S(y | x)]} is taken to be a linear function of 
xi,..., Xfc. (Another nice feature of the parametric Burr family is that the 
special-case log logistic is the unique distribution which is both a PO and 
an AFT model.) Usually we find, however, that the PH and AFT models 
between them provide sufficient flexibility to deal with most situations, pro¬ 
vided care is taken to accommodate extensions such as the possible depen¬ 
dence of p in (2.4), (2.5) or (2.6) on x. At the same time, we have not found 
it particularly useful to adopt more refined parametric distributions within 
either the PH or AFT models. For example, the log F (e.g., Kalbfleisch and 
Prentice, 1980, ch. 3) is sometimes put forward as an “error” distribution 
within the AFT model (see the form (2.3)). This has a second shape param¬ 
eter and includes the Burr family as a special case. We have found that, in 
view of the uncertainty surrounding an appropriate specification for 9 (x) and 
the assumption of a constant p parameter, additional parametric flexibility 
in the baseline distribution beyond that provided in the Burr family is usu¬ 
ally not essential. Indeed, with substantial censoring in the right-hand tail 
of the distribution it becomes difficult to discriminate well among baseline 
distributions anyway. In addition, the picture that emerges concerning the 
relative importance of explanatory variables is often fairly insensitive to the 
baseline distributed used. Of course, in situations with only a small number 
of explanatory variables, a more detailed look at the baseline distributions 
involved is feasible. 

We next discuss some specific statistical considerations in the use of the 
regression models described here. 

3. SOME STATISTICAL CONSIDERATIONS 

To facilitate modelling and the exploration of data we need the means 
to manipulate and portray the data and fitted models (graphically and oth¬ 
erwise), to fit models, and to assess their adequacy. We discuss aspects of a 
few important procedures here, and consider associated software in Section 
4. 

For convenient preliminary examination of a large data base, good graph¬ 
ics and data manipulation and summarization methods are desirable. This 
is not discussed here except in the context of an example later. We turn 
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immediately to model fitting and related matters. 

3.1 Model Fitting and Screening 

Relatively little needs to be said about the estimation of parameters, 
maximum likelihood estimation for models discussed in Section 2 being 
straightforward. The survival times in most data bases are subject to right 
censoring, so tfe consider the data on an individual to be (y*,5i,x*), where 
yi is either an observed survival time or a censoring time, 5* is an indicator 
which takes the value 1 if y* is a survival time and 0 if it is a censoring 
time, and x* is the covariate vector. Under quite general conditions (e.g., 
Kalbfleisch and Prentice, 1980, ch. 5) the likelihood function from data on 
n independent individuals is then 

L = f[f(y i \x i ) s <S(y i \yi i ) 1 - s <. (3.1) 

»=1 

Maximum likelihood estimates (MLE’s) for the parametric models of Sec¬ 
tion 2 are readily obtained by maximizing (3.1), and interval estimates and 
significance tests may be based on the usual large sample methods. Lawless 
(1982, ch. 6), Kalbfleisch and Prentice (1980, ch. 3) and Bennett (1983a) 
provide details for the Weibull, log logistic, Burr, and other models. For 
the Cox PH analysis a partial likelihood for /? in y(x) = exp(x 7 /?) of (2.1) is 
generally used for inference; see for example Kalbfleisch and Prentice (1980, 
ch. 4) for details. 

Since there are often many potential explanatory variables present in the 
data, “all-subsets” regression procedures are useful for model screening pur¬ 
poses. In particular, with k explanatory variables where are 2 k regression 
models that can be obtained by including any subset of Xi,...,Xk in the 
model and excluding the rest. Stepwise regression techniques are often used 
to determine a “good” model, but much more information about the relation¬ 
ship between explanatory and response variables is provided by determining 
all of the better-fitting models and not just the few discovered by stepwise 
fitting. For normal linear models all-subsets methods are well known and 
implemented, for example, in SAS (PROC RSQUARE) and BMDP (9R). 
Lawless and Singhal (1978,1985) discuss all-subsets procedures for may gen¬ 
eralized linear models, including those of Section 2. These operate as follows: 
the models in Section 2 all have regression coefficient parameters /?i,..., /?* 
plus some additional parameters ip. The 2 k “subset” models are obtained 
by letting different subsets of /?i,.. .,/?* have value 0. The parameters in 
ip are assumed to be always included in the model, the constant term a in 
such models as (2.4), (2.5) and (2.6) being one of these. For the Weibull 
model (2.4), for example, ip = (a,p)'. Good subset models are defined in 
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terms of the likelihood ratio statistic for testing the subset model against 
the full model. In particular, suppose /? = (fix, ^ 2 ) anc ^ consider the subset 
model corresponding to j 8 2 = 0. The likelihood ratio statistic for testing 
Ho : /3 2 = 0; unrestricted vs. H± : V* unrestricted is 

* = 2 log HP u p t £) - 2 log L(P U 0, VO, (3.2) 

where VO is the unrestricted MLE of VO an d 0®i>O>V0 i s the 

MLE with f} 2 constrained to be 0. Large values of R provide evidence against 
the subset model, and under jET 0 , R is distributed approximately as X(k-p) 
in large samples, where p = dim()9 1 ). Our approach is thus to determine 
those subset models for which R is small. To compare models with different 
numbers of covariates we may use the Akaike information criterion (AIC), 
defined as 

AIC = -2 log + 2(p + m), (3.3) 

where m = dim(^). This is analogous to the use of Mallows’ C p statistic 
with normal linear models. Software for carrying out all-subset searches is 
described in Section 4. 

3.2 Methods for Model/Data Criticism 

It is essential that we be able to assess the adequacy of fitted models, 
and it is also important to identify unusual or otherwise interesting features 
of the data. There is a large recent literature on model/data criticism (e.g., 
see Cook and Weisberg, 1982), but for the models of Section 2 in conjunction 
with censored data, relatively little has been done. We will outline briefly our 
current approach, which is simply to determine fitted residuals and leverage 
values which can be examined in various ways. 

We use the Cox-Snell (1968) type of residuals, which are discussed, for 
example, by Kalbfleisch and Prentice (1980, ch. 4), Kay (1977), Lawless 
(1982, chs. 6, 7). For the Burr models (2.6) these can take the form 

U = Vi K*i) - y? ex P(« + x $)> (3-4) 

where p, &,/? are MLE’s. Since t/* may be either a survival or a censoring 
time, if (2.6) is appropriate the r^’s should look roughly like a censored 
sample from the corresponding standard distribution having p = 1 , 0(x) = 1 , 
and the value of A in question. In particular, for the Weibull (A = 0) and log 
logistic (A = 1) models the r^s should look roughly like censored standard 
Weibull and log logistic samples, respectively. (It should perhaps be said 
that our approach is to fit members of the Burr family with fixed values 
of A. Although we often determine the value of A which maximizes the 



EXPLORATION OF LARGE MEDICAL DATA BASES 


9 


likelihood, we compute asymptotic variance estimates, confidence intervals 
and the like for with A treated as fixed.) 

For the Cox analysis of the PH model we define residuals as in Kay 
(1977), 

r» = [- log 5 0 (yi)]e x <^, (3.5) 

where fi is the partial likelihood MLE of fi and S Q (y) is the Nelson-Altschuler 
nonparametric estimator of 5 0 (y) given by (7.2.34) of Lawless (1982). If the 
PH model is appropriate these r^s should look roughly like a censored sample 
from the standard exponential (= standard Weibull) distribution. 

For plots of residuals against covariates or other factors we adjust resid¬ 
uals r* corresponding to censoring times upward by an amount equal to the 
median of e*, given €» > r», where has the standard distribution used 
to assess the r»’s. For example, for the Weibull model (2.4) and residu¬ 
als (3.4), Si has a standard exponential distribution with survivor function 
S(e) = exp(-e). It is easily shown that 


median(e* | £* > r t ) = r* + log 2, 


so we define adjusted residuals to be r* + log 2 when y* is a censoring time. 
In plots, residuals corresponding to uncensored and censored observations 
should still, however, be labelled differently. 

Probability plots of residuals from the fully parametric models can also 
be useful. Since the s are censored we form a product-limit survivor func¬ 
tion estimate from the (unadjusted) r^s assumed here to be distinct, as 


S(r) = 



(3.6) 


where n* is the number of residuals > r*. Then 5(r) can be compared 
with what would be expected from the distribution of the corresponding 
€. For the Weibull model 5(e) = exp(-e) so a plot of log 5(r*) vs. r* 
should be approximately linear with slope one. An alternative is to plot 
-log[— log 5(r $ )] vs. log r», which reveals more detail near r = 0. For 
the log logistic model the corresponding plot is of log[(l — £(r*))/5(ri)] vs. 
log r*. 

On a cautionary note, it should be said that it is not clear how good 
residual probability plots are at detecting model inadequacy. It is recognized 
that for the Cox PH analysis, probability plots of the r^’s defined in (3.5) 
may not indicate model inadequacy at all in some instances (e.g., Crowley 
and Storer, 1983). Residual plots for the fully parametric models appear 
more useful, but systematic study of their properties is needed. 
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It is also of interest to identify observations which exert a strong influence 
on parameter estimates, and observations which are unusual in their y and/or 
x values. A standard way of assessing the influence of an observation is to 
consider the effect on estimation of dropping it (e.g., Cook and Weisberg, 
1982). If a model has parameter 0 and likelihood function L(6 ) let 


and 


0(i) = MLE obtained by using all the data 
except the tth observation 


LDi = 2 log L{0) - 2 log L($ {i) ). 


Several authors have considered approximations to 6 and LDi that avoid 
intensive computation needed to compute the exact quantities for each of n 
observations. No thorough comparison of approaches has been made for the 
models of Section 2, but see, for example, Hall et al . (1982) for discussion 
pertaining to the parametric models and Cain and Lange (1984), Moolgavkar 
et al. (1984), and Storer and Crowley (1985) for the Cox PH analysis. The 
various approximate influence measures are in a rough sense functions of 
some kind of generalized residual that measures agreement between y» and 
the fitted model, and some measure of the distance from x* to the centroid of 
Xi,... ,x n . We are investigating the use of various approximations but given 
our present state of knowledge and the relatively large computational effort 
large data bases and some of the methods entail, our current approach is 
simply to examine the residuals r t - described above, along with the leverage 
values ha = x^pf'Jf) _1 x^ used with the normal linear model. The ha's 
help locate observations that are unusual in their x* values, whereas the r/s 
help identify interesting features in the fit of F* on x t . In addition, the most 
influential observations tend to be those with large r* and/or ha values. 


4. SOFTWARE 

For data manipulation and portrayal, major packages such as SAS and 
BMDP provide good software. Good and widely accessible software for 
fitting and exploration of common survival analysis models was for a long 
time unavailable, but the situation is now improving. Both SAS and BMDP 
have Cox PH analysis programs, and the latest version of SAS has a program 
that handles censored data from the Weibull, log logistic, log normal and 
other regression models. Some other packages also provide a fairly wide 
range of model fitting capabilities, but are not as accessible or as viable with 
large data sets. Wagner and Meeker (1985) provide a very good survey of 
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software in this area, and it appears that even better systems wll be available 
soon. 

To provide a comprehensive package that handles both the important 
continuous response regression models and discrete-response models used in 
survival analysis, we have designed, and K. Singhal has written, a system 
called ISMOD. A more complete description of ISMOD is given by Lawless 
and Singhal (1985) and by Singhal (1985). For our purposes here, we note 
that the package can fit most of the common survival time regression models, 
including the exponential, Weibull, log logistic, Burr, log normal and Cox PH 
models. In fitting a single model its output includes parameter estimates, 
asymptotic covariance matrices, and residuals. It handles censored data 
and at present can accommodate up to 20 explanatory variables and large 
numbers of observations. It also has an all-subset feature and will find, 
according to likelihood ratio or approximate likelihood ratio criteria, the 
best-fitting models. In particular, it will find the “best m” models of each of 
sizes 1,2,..., k for m < 10, or the best m models overall in terms of the AIC 
criterion (3.3). This is done using efficient algorithms that keep computation 
to a minimum. 

ISMOD contains facilities such as data transformation and data selection 
capabilities, so that one can transform variables or fit models to parts of the 
data. It also has the ability to generate two-way tables of summary statis¬ 
tics for pairs of variables: this is useful both in preliminary data exploration 
and in summarizing the data. A decision wets taken not to include graphics 
capabilities or extensive diagnostic facilities inside ISMOD at present, be¬ 
cause of our feeling that this area still required more development. However, 
ISMOD produces listing files of residuals and other quantities which can be 
plotted and manipulated using other software. We give an example in the 
next section. 


5. AN EXAMPLE: OVARIAN CARCINOMA 

We consider a data base that contains information on 704 ovarian car¬ 
cinoma patients diagnosed during the period 1971-79 and registered at the 
Princess Margaret Hospital in Toronto. The response variable discussed is 
survival time, measured from diagnosis. Of the 704 patients, 341 had died 
by the time of the current data file update and hence there are 364 cen¬ 
sored observations. After discussions with medical people and preliminary 
examination of the data it was decided to consider the explanatory variables 
Xi,.. .,xi 2 shown in Table 1. Except for age, all explanatory variables are 
categorical and we have used dummy covariates to represent them. The 
“histology” and “grade” covariates refer to pathological characteristics of 
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the carcinoma: the five histology categories are serous or mixed (SER or 
MIX), endometroid (END), mucinous (MUC), clear cell (CLR) and unclas¬ 
sified (UNCL). Grade categories are moderately or well differentiated (MD 
or WD) and poorly differentiated (PD). The “stage” covariates refer to Figo 
staging and are related to the extent of the disease; prognosis tends to be 
poorer the higher the stage number. “Surgery” refers to whether a complete 
or incomplete Bilateral Salpingo Oopherectomy Hysterectomy (BSOH) was 
performed on the patient, and “residuum” refers to the amount of diseased 
tissue remaining after surgery. Clinical background on these data and an 
ovarian cancer in general can be found in Dembo et al. (1982), Dembo 
(1984) and other articles cited in these papers. 


Table 1. Ovarian Cancer—Initial Variables 


Variable 

Description 

# Cases 

# Uncensored 

Y 

Time (days) from diagnosis 
to death 

704 

341 

6 

1 - time uncensored, 0 - censored 

704 

341 

*i 

age at diagnosis (years): 
mean 54 

704 

341 

x 2 

Histology 1: 1-SER or MIX, 
0-Other 

345 

176 

*3 

Histology 2: 1-END, 0-Other 

143 

57 

*4 

Histology 3: 1-MUC, 0-Other 

79 

20 

*5 

Histology 4: 1-CLR, 0-Other 

28 

14 


(Histology baseline is UNCL) 

109 

74 

*6 

Grade: 1-MD or WD, 0-PD 

160,544 

37,304 

X V 

Stage 1: 1-Stage I, 0-Other 

117 

19 

X S 

Stage 2: 1-Stage II, 0-Other 

217 

78 

x 9 

Stage 3: 1-Stage III, 0-Other 

275 

172 


(Stage baseline is IV) 

95 

72 

*10 

Surgery: 1-Complete, 0-Inc. 

396,308 

128,213 

*11 

Residuum 1: 1-Small, 0-Other 

110 

54 

*12 

Residuum 2: 1-Large, 0-Other 

296 

198 


(Residuum baseline is none) 

298 

89 


The analysis which follows was carried out using the package ISMOD 
described in Section 4, along with the Integrated Software Systems Corp. 
graphics package DISSPLA. The data set is complex, and the analysis is 
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meant to illustrate various points rather than to give an exhaustive discussion 
of its features. 

We note to start that medical background provides a lot of information 
about the qualitative nature of the relationships between covariates and sur¬ 
vival time, and among the covariates. For example, surgery, residuum and 
stage covariates are related, since the location and extent of the disease at 
diagnosis affects all three. It is also clear that the prognosis tends to be 
poorer for patients in higher stages, for patients with large residuum rather 
than none or small, and so on. In preliminary examination of the data we 
generated various plots and tables such as Tables 2 and 3, which show sum¬ 
mary data for cross-classifications of explanatory variables. Table 2 and 3 
give, for each pair of categories in the table, the total number of observations 
(n), the number of uncensored observations (n u ), and an estimated mean 
survival time y computed as the total of survival and censoring time divided 
by n u . This estimate, which is the MLE of mean survival time when times 
follow an exponential distribution, provides a crude picture of the effects of 
covariates. It should be noted that the estimate may be very unreliable when 
n u is small, and when censoring is very heavy. Note also that in considering 
the possibility of covariate effects under the models described in this paper, 
ratios of these means, rather than differences, are of interest. 

Plots of the data are also useful. Figure 1, for example, shows product- 
limit survivor function estimates S(y) for individuals in each of Stages I to 
IV (unadjusted for any other covariates). Widely varying survival experience 
in the different stages is apparent. 

As a first pass at model fitting we consider AFT and PH models with 
covariates ®i,..., X 12 present. The Weibull, Cox PH and log logistic models 
turn out to give essentially the same picture regarding covariate effects, 
although they do not fit the data equally well. Table 4 shows MLE’s and 
z values (= MLE -f- estimated standard deviation) under the three models. 
It should be remarked that the meaning of a regression coefficient ft is not 
exactly the same under the three models. For the Cox PH and Weibull 
models, the ft’s have the same interpretation within the PH family, and 
direct comparison of values is relevant. Within the AFT family, values of 
ft/p should be compared under the Weibull and log logistic models. 

From an examination of the fitted models and the data, it is clear that 
the stage covariates and the grade covariate are very important, with age, 
surgery, and large residuum also having some effect. Although the different 
models agree broadly as to the importance of the covariates, they do not fit 
the data equally well: the log logistic fits somewhat better than the Weibull 
and there is a clear indication that the Weibull model, at least with the 
given set of covariates, is unable to account for all of the heterogeneity in 
the data. For example, Figure 2 shows a residual probability plot for the 
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Table 2. Summary Data for Stage x Surgery Categories 

Stage 



i 

II 

III 

IV 

Total 

Complete 

n = 111 

n = 173 

n = 93 

n = 19 

n = 396 


n u = 19 

n u = 48 

n u = 47 

n u = 14 

n u = 128 

Surgery 

y = 7219 

y = 4852 

y= 1515 

y = 2326 

y = 3701 

Incomplete 

n = 6 

n = 44 

n = 182 

n = 76 

n = 308 


n u = 0 

n u = 30 

n u = 125 

n u = 58 

n u = 213 


y = undef. 

y = 1445 

y = 790 

y = 200 

y = 839 


Total 

n= 117 

n = 217 

n = 275 

n = 

95 


n u = 19 

n u = 78 

n u = 172 

n u - 

= 72 


y = 7833 

y = 3542 

y = 988 

y = 

260 


Weibull model with all 12 covariates present. The plot is not close to linear 
and in fact consists very roughly of two segments, which is an indication 
that there is heterogeneity related to longer and shorter term survivors that 
is not accounted for by the Weibull model. Other evidence is also available: 
for example, within the Burr family (2.6) the maximized log likelihoods for 
the Weibull and log logistic models are -692.5 and -679.2, respectively, 
indicating strong support for the log logistic over the Weibull. In addition, 
plots of the nonparametric estimate S 0 (y) (see (2.1)) obtained from the Cox 
PH analysis indicate a shape not consistent with the Weibull model and 
there is a suggestion of a non-monotonic hazard functon. The log logistic 
model fits the data somewhat better, though systematic departures from it 
are also apparent. 

At this point, the evidence suggests that stage and grade are particularly 
important covariates, that several others are of some importance, and after 
adjusting for main effects of these covariates, there is an indication of a non¬ 
monotonic hazard function that first increases and then decreases. To look 
for further explanations of the wide variability in survival times, we decided 
next to examine the data separately within each stage. There are several 
reasons why this is a good idea, including the possibility of stage by other 
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Table 3. Summary Data for Stage x Grade Categories 

Stage 



i 

II 

III 

IV 

Total 

Moderately 

n = 48 

n — 65 

n = 38 

n = 9 

n = 160 

or Well Diff. 

n u = 2 

n u = 14 

n u = 13 

n u = 8 

n u = 37 

Grade 

y = 35,664 

y = 7129 

y = 3038 

y = 479 

y = 5796 

Poorly Diff. 

n = 69 

n = 152 

n = 237 

n = 86 

n = 544 


n u = 17 

CD 

II 

e 

n u = 159 

n u = 64 

n u = 304 


y = 4558 

y = 2757 

y = 820 

y = 449 

y = 1359 

Total 

n = 117 
n u = 19 
y = 7833 

n = 217 
n u = 78 
y = 3542 

n = 275 
n u - 172 
y = 988 

n = 95 
n u = 72 
y = 260 



covariate interactions, the fact that the data are highly unbalanced with 
regard to certain combinations of stage and other covariates, and the fact 
that the data set is very heavily censored and that the degree of censoring 
varies widely across stages. It is also clear that for stages I and II and 
to a lesser extent stage III there are substantial proportions of long term 
survivors; see Figure 1. With the heavy censoring in these stages it is difficult 
to say whether this can be explained completely in terms of the covariates 
under study, but it does not appear that it can. The picture here is rather 
different than for stage IV where there are many fewer long term survivors. 

There was doubt that Weibull or log logistic models could hope ade¬ 
quately to fit the data for stages I to III, because of the appearance that 
there are individuals whose long term survival is not fully explainable by 
these models and the covariates that are present. Nevertheless we fitted 
these models in order to get an idea of the importance of covariates within 
stages. This prompts the following remarks. 

1. The importance of the other covariates depends to a large extent on the 
stage. Table 5 shows the factors that appear important, for each stage. 
We find, for example, that the effect of age is more pronounced in stages 
I and IV than in II and III. Note also that Grade has an important effect 
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Figure 1. Empirical survivor functions, (product-limit estimates) } stages I 
to IV. 

in all stages except IV, where there are however only a very few patients 
who do not have poorly differentiated tumors. 

2. For stage IV the parametric models fit pretty well. For stages I to III, 
however, the log logistic and Weibull models both overestimate the num¬ 
ber of short to medium term survivors and underestimate the number of 
long term survivors, as one might expect from the discussion in 1. This 
is less pronounced in stage III than in I and II. Figures 3 and 4 show 
raw product-limit estimates S(y) computed from the lifetimes in stages 
II and III with no covariate adjustment, along with estimated survivor 
functions 

£(v) = -J- jri 1 +*(*yr 1 

computed under the log logistic model with the covariates indicated on 
the graph. Here, the sum is over the individuals in the stage and n 8 is 
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Table 4. MLE’s fa Under Three Regression Models , for Full Data Set 



Covariate 

Cox PH 

Weibull 

Log logistic 

1. 

Age 

.022(3.8)* 

.024(4.1) 

.026(3.3) 

2. 

Histology 1 

-.073 (-.5) 

-.10(-.7) 

—.23(—1.1) 

3. 

Histology 2 

-,042(-.2) 

-.11 (-.6) 

-.078(-.3) 

4. 

Histology 3 

-.27(-1.0) 

-.26(-1.0) 

-,23(-.6) 

5. 

Histology 4 

.49(1.6) 

.55(1.8) 

.56(1.3) 

6. 

Grade 

-.70(-3.7) 

~.83(-4.4) 

-1.12(-4.6) 

6. 

Stage 1 

-2.04(-6.5) 

-2.21 (-7.1) 

-2.99(-7.6) 

8. 

Stage 2 

-1.54(-7.5) 

-1.72(-8.3) 

-2.39(-8.5) 

9. 

Stage 3 

-.74(-4.8) 

-.74(-4.8) 

-1.19(-5.3) 

10. 

Surgery 

-50(-2.8) 

-.50(-2.8) 

-62(-2.5) 

11. 

Residuum 1 

.29(1.4) 

.26(1.3) 

.41(1.6) 

12. 

Residuum 2 

.29(1.6) 

.30(1.7) 

.59(2.4) 


MLE p (s.e.) 


1.23(.052) 

1.67(.075) 


*Z-values are in brackets. 


Table 5. Important Factors Within Stages 
Stage 


I II III IV 


Age Grade Histology MUC Age 

Grade Surgery Grade Surgery 

Residuum Residuum 


the number of individuals in the stage. These plots portray the over- 
and underestimation mentioned above. 

3. It is of interest to fit models that explicitly recognize the possibility 
of long term survivors. This can be done, though the heavy censoring 
present makes clear assessments difficult. 

The data were also examined with regard to influential observations and 
in a number of other ways. In the interest of economy and brevity we will 
only remark that influence analysis did not yield any really striking results. 
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Figure 2. Residual probability plot, Weibull model. 

The size of the data set and the heavy right censoring no doubt have someting 
to do with this. Of course, progressively more refined models can also be 
fitted to the data, but we will not discuss this here. 


6. CONCLUDING REMARKS 

Regression methodology provides useful tools for investigating large 
medical data bases. It is our experience that with software of the type 
discussed in Section 4 it is feasible to explore the data and to fit and assess 
regression models with moderate amounts of computer time. Models such 
as the Weibull, log logistic and Cox PH models discussed in this paper are 
easily handled and provide between them sufficient flexibility to determine 
many features of the data. Although in large data bases they invariably 
will not give a completely satisfactory description of the data, they provide 
convenient and useful baselines against which to look for other interesting 
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Figure 3. Empirical (product-limit) and estimated survivor functionSj stage 
II. 

features. 

We also find categorical response regression models very useful, and will 
conclude with a few remarks about them. The models we have so far found 
most convenient are two types that handle ordered categorical responses. If 
Y is a categorical variable which can take on values 1,2,..., c with proba¬ 
bilities .P»(x) i = 1,..., c, where = 1 and x is a vector of covariates, 

then the two types can be represented as follows: 

(a) Following McCullagh (1980) and others, we take 

7 j(x) = Pr(y < j | x) 

to be of the form F{otj + x'j8), where F is a continuous distribution 
function on (- 00 , 00 ) and -00 < ai < ... < a c = 00 . Two flexible 
models are the ones for which 


F(u) = e u /(l + e u ) and F(u) = 1 - exp(-e u ), 
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Figure 4. Empirical (product-limit) and estimated survivor functions, stage 
III 


respectively. 

(b) In this case regression models are specified in a form that makes them 
convenient for handling grouped and possibly censored lifetime data 
(e.g., Lawless, 1982, section 7.3). These take 

?y(x) = Pr(r = j\Y> 3 ,-x) 

to be of the form F(otj + x 7 /?), where F and the otj y s are as in (a). 

Between them the models in (a) and (b) provide good flexibility for exploring 
categorical data. Regarding software, we note that the package ISMOD 
discussed in Section 4 also handles these methods, providing the same sorts 
of features described and illustrated above for the continuous survival time 
models. 
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RECURSIVE PARTITION: 

A VERSATILE METHOD FOR EXPLORATORY 
DATA ANALYSIS IN BIOSTATISTICS 

ABSTRACT 

In clinical and epidemiological studies, interactions among covariates are 
of primary importance. In the absence of clearly stated a priori hypotheses, 
it is usual to build predictive models by means of stepwise logistic or Cox 
regression, a practice that tends to emphasize main effects and overlook 
possible interactions. 

An alternative approach is that of Recursive Partition (RP), a well 
known clustering method adapted by us to the case of response variables 
frequently encountered in Biostatistics (e.g., censored survival times and 
discrete response). 

We will discuss the application of RP to the determination of prognostic 
strata, propensity strata, and treatment response strata. 

In each case we look for a finite number of subgroups of a large patient 
population, homogeneous with respect to prognosis, probability of being 
assigned to a certain treatment, and response to treatment, respectively. 

We will also tackle the general problem of evaluating the statistical 
soundness of exploratory analyses. The use of cross-validation, bootstrap¬ 
ping and model selection criteria, such as the Akaike Information Criterion, 
will be discussed. 
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1. INTRODUCTION 

In clinical and epidemiological studies it is often useful to summarize 
complex information by identifying and describing homogeneous strata, i.e., 
subpopulations of a given population which are defined by values of certain 
clinical variables and are homogeneous in a specified sense. We will discuss 
here three examples of particular importance, although many more situations 
could be cast in a similar framework. 

a) Prognostic Stratification 

One attempts to identify subpopulations of homogeneous prognosis. 

b) Propensity Stratification 

The aim is to find subpopulations within which the probability of being 
assigned to a certain treatment is homogeneous. This is a useful preliminary 
step before testing treatment effect (Rosenbaum and Rubin, 1984). 

c) Treatment Response StratiScation 

Subpopulations are sought within which the treatment-response rela¬ 
tionship is homogeneous, with the aim of separating cases for whom the 
treatment is effective, from those for whom it is useless or harmful (Gail and 
Simon, 1985). 

The problem of finding an appropriate stratification is usually solved by 
ad hoc adaptations of regression techniques. For example, Byar (1982) used 
Cox and Weibull regressions to define a score s = /? • z, where ft is the esti¬ 
mated regression parameter vector and z is a covariate vector; the strata are 
then defined by splitting the score into appropriate intervals. A similar pro¬ 
cedure could be applied to the propensity score in order to obtain propensity 
strata; in this case the score is constructed from a logistic regression model 
for the probability of receiving a certain treatment (Rosenbaum and Rubin, 
1984). 

These procedures have two main shortcomings: they lead to descrip¬ 
tions which are not always clinically interpretable, and they are based on 
regression models which, for practical reasons, are often built without proper 
consideration for interactions among clinical variables. 

This work is devoted to an alternative approach which avoids the above 
shortcomings and has the additional advantage of providing a general and 
flexible framework for the problem of stratification. It is based on the idea 
of “growing trees”, i.e., of successively dividing the population into subpop¬ 
ulations of increasing homogeneity until one obtains populations for which 
the degree of homogeneity cannot be increased by further splitting. An es¬ 
sential feature of the approach is that splits are induced by statements on 
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certain “predictor” variables so that, by definition, the strata are described 
in direct clinical terms. An algorithm based on this approach is termed Re¬ 
cursive Partition (RP) (Hartigan, 1975, Ch. 18). An early example of RP is 
the AID algorithm and its generalizations (Sonquist et a/., 1973; Gillo and 
Shelly, 1974). 

More recently, the monograph by Breiman et al. (1984) has introduced a 
number of ground-breaking new ideas and has described a computer program 
named CART (Classification and Regression Trees). 

The present work develops the tree-growing approach for the situations 
prevalent in Biostatistics, where one often encounters non-standard, non¬ 
normal regression models. 

2. DEFINITIONS AND NOTATION 

We consider data in the form (uW,*W), t = 1,.. ,,iV, where % labels 
individual and u, z are (row) vectors of variables to be further described 
below. 

We shall refer to the components of u as criterion variables and to those 
of z as predictor variables. Predictors contain background information used 
to define strata which are homogeneous according to a criterion based on 
the u variables; thus, for each homogeneous stratum one can define a unique 
criterion quantity independent of the z variables given the stratum. 

For prognostic stratification the criterion variables are of the form: 

where t is the observed survival time and 6 is the censorship indicator, with 
S = 0 if t is censored and S = 1 if it is an observed failure time. (We recall 
that a survival time is said to be (right) censored if it is a lower bound to 
the actual failure time.) The criterion quantity is the survival curve. The z 
variables are known as prognostic factors. 

For propensity stratification we have a single criterion variable u, a cat¬ 
egorical variable indicating treatment group. The criterion quantity is the 
vector of probabilities of being assigned to each treatment group. 

For treatment response stratification, the criterion variables are of the 
form 

u f*'> = (yW,x w ), i=l,...,N, 

where y is the response variable (generally of the form (t } 6) as above), 
and x is a vector variable defining treatment. The criterion quantity is 
the appropriate regression equation which represents the treatment-response 
relationship. 
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We will assume that predictors are categorical variables, i.e., z = 
(*i > Z 2 ,..., Zk) and for each j, Zj has a finite number of levels £ 1 , £ 2 ,..., £ my • 
A categorical variable zy is ordered if there is a natural order relationship 
among its levels, e.g., £1 < £2 < • * * < £m y > and unordered otherwise. 

Our aim is to build a binary tree with nodes representing subpopulations. 
In particular the root of the tree represents the whole population and the 
terminal nodes represent the homogeneous strata (see Figure la). Any node 
which is not terminal is split into two branches by a statement about z; by 
convention, the node issuing from the left branch represents the subpopu¬ 
lation satisfying the statement and the one issuing from the right branch 
represents the subpopulation not satisfying it (Figure lb). 


A typical Recursive Partition Tree 



There will be usually some restriction on the statements defining splits. 
We shall consider two classes of Split Defining Statements (SDS). 

The class of simple SDS , SS 

The class consists of statements of the form zy G Ay, where Zj is a 
component of z and Ay is a subset of the rrij levels of zy. For Zj unordered, 
Aj can be any of the 2 m *“ 1 nontrivial subsets of {£i,£ 2 , .. ->£m y }- F° r z j 
ordered, Aj can be any of the my — 1 subsets of the form Ay = [£i,£], 

£ = £i>...>£m y -i. 

The class of Boolean Combinations , SB 

This class includes statements of the form t Zj l e Aj x and Zj 2 G Ay 2 and 
... and zy r G Ay r ’, r = 1,..., k. 
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We shall use the following abbreviations:: 

[zj G Aj\ for {all individuals t | z^ G Ay}, 

[zj < £] for [zj G [£i,£]] = {all individuals i | z^ < £}, 

[ Z jl ^ ^.71 > Z 32 ^ > * * * > ^ ^3r\ ^*=1 \ Z 3* ^ 

etc. 


3. DISSIMILARITY MEASURES AND HOMOGENEITY CRITERIA 

In an RP tree each non-terminal node is split by a statement of the 
SDS class into two nodes which represent snjbpopulations as dissimilar as 
possible from the point of view of the criterion quantity. A terminal node 
represents a population which cannot be split any further and is therefore 
to be considered homogeneous. Thus, in order to specify an RP algorithm 
we need a dissimilarity measure between two disjoint populations and a 
homogeneity criterion to decide whether a node should be split further. 

The likelihood ratio statistic (LRS) provides a natural measure of dis¬ 
similarity as follows. Let Pi, P 2 be disjoint populations and let P denote 
their union. We shall assume that the criterion quantity is represented by a 
parameter 6 which may take different values 9 1 , 62 for Pi and P 2 and that 
likelihood functions L 2 (9 2 ) can be defined for Pi, P 2 . We shall also 

assume that the likelihood function for P is of the form: 


L{0 1 ,$ 2 ) = L 1 {$ 1 )L2{e 2 )- 


Consider now the hypothesis 


(i) 


H 0 :9 1 =$2 = 9 


and the alternative 

H : 0i 7 ** 2 . 

Then the LRS of H versus H 0 is defined as: 

p{H \H 0 ) = 2 log {[L 1 (6 1 )L 2 (h)}/lI'i(0)L2(m, (2) 

A A 

where $i 9 0 2 are the maximum likelihood estimates (mle’s) of 0 i and 0 2 
under H and 0 is the mle of 6 under H 0 . Clearly, the larger p(H | H 0 ) is, 
the greater is the evidence in the data that Pi and P 2 are dishomogeneous 
with respect to the criterion quantity. It therefore provides a reasonable and 
general measure of dissimilarity: 


d(P u P 2 ) = p(H \ H 0 ). 


( 3 ) 
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It is easy to generalize the definition of dissimilarity between two disjoint 
populations to an arbitrary number of disjoint populations Pi,P 2 ,.. 

The LRS of the hypothesis H : $i are not all the same, versus H 0 : all $i are 
equal, can be written 


p(H \H 0 ) = 2 log 


n^)/n^) 

m t=i t= i 


(4) 


and provides a measure of global dissimilarity among the n populations. This 
expression is useful in evaluating tree structured predictors and will be taken 
up again in Section 5. 

The LRS also provides a natural tool to define homogeneity. We recall 
that under regularity conditions p(H | H 0 ) is asymptotically x\k)y where k = 
dim(0) for the case of two populations [for n populations k = (n- l)dim(0)]. 
Thus, it would seem reasonable to consider a population homogeneous if for 
every possible split {Pi, P 2 }, d(Pi, P 2 ) is not statistically significant at some 
preset significance level a. On the other hand, we also wish to avoid splits 
into subpopulations which are too small; if for no other reason, because the 
X 2 approximation for the LRS would not be valid. These considerations lead 
to the following: 


Definition 

A split {Pi, P 2 } is admissable if for preset a E [0,1] and m an integer: 

a) Pi and P 2 have at least m individuals; (in the case of censored survival 
data we may require that in each population there be at least m observed 
failures). 

b) d(Pi,P 2 ) is significant at the a significance level. 

A population is homogeneous if no split generated by the SDS class is 
admissable. 

We shall now discuss the choice of dissimilarity measures for the three 
stratification problems studied in this work. 

In the case of prognostic stratification we may assume that the time to 
failure is exponentially distributed, i.e., the survival function for P^, £=1,2, 
is: 

S t (t) = exp{-A*t} 

or, equivalently, the density function is: 

ft(t) = Xiexp{-Xit}, 

where A* is the parameter of interest. Then, 33 is well known (Lawless, 
1982), we have, in the hypothesis of random censoring: 

log Lt(Xi) = £ * (<) log + (1 - * (<) ) log 

iePt 
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More realistic (and more complex) models (Weibull, gamma, etc.) can be 
used, but computations become very costly. Alternatively, one can define a 
LRS based on Cox pseudo-likelihood (Cox, 1972), as follows. Let us assume 
that the hazard function A (t) = f{t)/S(t) of one population is proportional 
to that of the other. Let 7 be an indicator variable which takes value £ for 
an individual in Pi> £ = 1,2. Equivalently the proportional hazards (PH) 
assumption can be written: 


A(t;7) = e A 7A 0 (t), 

where A 0 (t) is an unspecified baseline hazard function. Then the LRS can 
be defined as: 


p(H | H 0 ) = -2 log {l c ° x (A)/L c ° x ( 0 )} 

(For a definition of L Cox see Lawless, 1982, pp. 344-372). 

It must be noticed that this approach rests on the PH assumption; how¬ 
ever, this seems to be weaker than the assumption of exponential failure 
times. 

In the case of propensity stratification the appropriate likelihood for Pi 
and P 2 is the multinomial likelihood and the parameter of interest is p = 
{PuP 2 , • • - ,Pa)> the vector whose j th component is the probability of being 
assigned to the jth treatment. 

We have: 

8 

w*) = Y[p?t> *= 1 , 2 , ( 5 ) 

j=i 

where rijt is the number of individuals in Pi assigned to the jth treatment. 

In the case of treatment response stratification , we may assume that the 
treatment variable x influences response for Pi by a PH model: 

A*(i;x) = eA*A 0 *(t), 

where fi t is the vector of regression coefficients for the £th population. 

Then 

Li((i t ) = L c °*{l3 t ), £=1,2, ( 6 ) 

and 


p(H \H 0 ) = 2 log { [l? ox 0S 1 )L? ox (£ 2 )] / [L?‘} . 
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4. RECURSIVE PARTITION AND AMALGAMATION ALGORITHMS 

In order to define an RP algorithm one needs to specify three objects: 
(a) an SDS class 5, (b) a dissimilarity measure d(Pi, P 2 ) defined for any pair 
of disjoint subpopulations, (c) a criterion of homogeneity C (or C a when we 
wish to emphasize its dependence on the chosen significance level a ). Then 
an RP algorithm is defined as follows: 

RP Algorithm 

Choose S, d, C a . 

Denote by N = {Ni> # 2 , • • •, N r } the current collection of nodes. 

(i) To initialize, set r = 1 and let Ni represent the whole population. 

(ii) For every Nj G N and every split {Pi y , P^^} defined by an element of S', 
compute d(Pi y ,P 2 y ) and check admissability. 

(iii) Using the results of (ii), check C a for each node. 

(iv) If all nodes of N pass the homogeneity criterion, declare the nodes of 
the current collection terminal and STOP. ELSE: 

(v) Among all nodes which do not pass C a , choose the node N£ correspond¬ 
ing to the split {Pj\, Pjly} with largest dissimilarity and replace Nf by 
two nodes representing P{, and P 2 * y . Use the resulting collection of nodes 
as current and go to (ii). 

In drawing the tree, we start from the whole population which we rep¬ 
resent as the root; each node which is split is joined by branches to its two 
“sons”. 

The search for the largest dissimilarity in tree construction is the most 
expensive part of the algorithms and the cost increases with the size of S. 
If 5 is SB, the Boolean Combination SDS class, the cost is prohibitive. An 
approximate search in SB was proposed by Breiman et al. (1984, pp. 136- 
138) and we have used it here, with obvious adaptations. 

From Breiman et al. (1984, pp. 140-150) we have also adapted the 
method for handling missing data, based on the notion of surrogate variables, 
and the definition of variable importance as an aid to interpretation. 

While the terminal nodes of an RP tree are homogeneous by definition 
(in the sense specified by C a ), there is no guarantee that terminal nodes 
with a different “parent” node have truly different criterion quantities. For 
example, it is possible that nodes 4 and 5 of the tree in Figure 2 have 
indistinguishable survival curves. It is therefore useful to have an algorithm 
for amalgamating nodes which are not distinguishable from the point of view 
of the criterion quantity. 

Let Pi, P 2 , • • •, Pn be disjoint populations which are of “sufficient” size, 
since they may be thought of as arising from an RP algorithm. We need a 
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I 



Figure 2. “Similar” terminal nodes 4 and 5 are joined by the Amalgamation 
Algorithm . 

joining criterion to decide whether any two populations are similar enough 
to be merged. Again, we can use the LRS to compare populations: 

Definition 

Let Pi, P 2 be two disjoint populations and let a be a preset SL. We shall 
say that P\ and P 2 are joinable if d(Pi, P 2 ) is not significant at the a level. 

Formally, an Amalgamation Algorithm (AA) is as follows (Hartigan, 
1975). 

Amalgamation Algorithm 

Choose d and a joining criterion J a . Denote by P = {Pi,P 2 ,.. .,P n } 
the current collection of populations. 

(i) To initialize, choose the original collection {Pi,P 2 ,.. .,P n } as current. 

(ii) Calculate the matrix of d(Pt, Py), 1 , j = 1,..., r, and check J a for each 
pair. 

(iii) If no pair meets J a conclude that the current collection is a collection 
of distinct populations and STOP. ELSE: 

(iv) Merge the pair {P? >P 7 *} with smallest dissimilarity and substitute for 
P/,Py their union. Declare current the collection thus obtained. Go 
back to (ii). 
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Notice that an AA is much simpler than an RP algorithm. After each 
merge, for example, only the i*th row and the j*th column of the new 
dissimilarity matrix must be computed. 

If ol is chosen so large that the joining criterion is always satisfied, an AA 
also leads to the construction of a rooted tree which is built “backwards”, 
starting from the terminal nodes. There is no reason to expect that by 
applying an AA with a very large a to the terminal nodes of an RP tree we 
should obtain the original RP tree. 

5. FINDING AN “HONEST” TREE: 

RESAMPLING PLANS AND THE AIC 

Although the following discussion is also applicable to Amalgamation 
Algorithms, and indeed to all exploratory methods such as stepwise regres¬ 
sion, we will return to RP and limit our discussion, for the sake of simplicity, 
to RP trees. 

An RP tree is constructed according to a homogeneity criterion C a — 
again for the sake of simplicity we will assume that the minimum size of 
admissable subpopulations is fixed once and for all—and the SL a is used as 
a “stopping rule”, to decide when a node is terminal. This choice, however, 
is vulnerable to the criticism that we have ignored the problem of multiple 
comparisons. Since at each node many splits are examined and many SL’s 
are computed, the SL a is at best nominal. Indeed, no matter what stopping 
rule is used, an RP tree would still have the tendency to overfit the data, 
that is, to capitalize on the peculiarities of the particular data from which it 
has been constructed. In order to avoid making unwarranted generalizations, 
it is advisable to have a separate collection of data on which to evaluate a 
tree constructed in an exploratory mode; however, such separate collections 
are rarely available. An alternative is to split the original sample into two 
halves, one for tree construction, the other for tree evaluation. By doing this, 
however, one sacrifices sample size, hence statistical power, which may be a 
serious drawback unless the data set is extremely large. A third possibility is 
that of generating a number of pseudo-samples from the original data set by 
one of several “resampling schemes” (Efron, 1982); these samples are then 
used for evaluating trees which have been generated on the original data set. 

Let D denote a data set of size N . By “resampling plans” we 
mean a procedure to generate a number M of pseudo-samples from D : 

..., The following are some of the most popular resam¬ 

pling schemes. In jackknife and cross-validation i = 1,..., N, is ob¬ 
tained by dropping the ith individual from D ; we then obtain N samples 
of size N — 1. In v-fold cross-validation one constructs M = N/V samples 
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of size N — N/V from D by dropping N/V groups of size V. The bootstrap 
builds M — B samples ( B arbitrary and large) of size N by sampling N 
elements with replacement from D , B times. 

We propose to generate a nested family of trees from D and to use D^\ 
i = 1,..., M, in order to select one that is less vulnerable to the criticism 
of overfitting—an “honest” tree. For each tree Tk and each pseudo-sample 
DW , we use the global dissimilarity p (defined by equation (4), Section 4), 
of the terminal nodes to measure the adequacy A of Tk for D W. For 
the bootstrap and the jackknife we simply estimate the parameters of the 
likelihoods and calculate p on D W. In cross-validation , on the other hand, 
we estimate the unknown parameter on D W and calculate p for the data 
points in — D. The algorithm described below is a generalization of one 
originally proposed by Stone (1974), who used cross-validation , and of the 
approach of Breiman et al. (1984) to tree construction. 

Tree Selection Algorithm (TSA) 

(i) Choose a sequence of SL’s ai < < • • • < ot n . 

(ii) For each a* build an RP tree, Tk, with C a as homogeneity criterion. 
Clearly, T\ < T% < • • • < T n . (In particular, if cl\ = 0, Ti is the 
trivial tree with only one node representing the whole population, and 
if ot n = 1, T n is a tree with terminal nodes such that any further split 
results in subpopulations which are too small to satisfy the homogeneity 
criterion.) 

(iii) By an appropriate resampling scheme build M samples D^ % \ i = 
1,..., M, from D. 

(iv) Let A denote the adequacy of the tree Tk for . Compute 

i M 

Ave (A fc ) = — 

i=i 

f jvf 2 ) 

SD{Ak) = { " Ave ( Afc )] 2 ) ' 

(v) Choose the tree corresponding to maximum Ave(Afc). Or choose the 
simplest tree among those for which Ave(Ajt) falls within £ SD’s from 
the “best” tree (£ fixed and usually £= 1,2,3). 

The above TSA is computationally expensive because of the resampling 
necessary to calculate Ave(Ak) and SD(Ajt). A great simplification is intro¬ 
duced by using the Akaike Information Criterion (AIC) instead of Ave(Ak) 
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and by selecting the tree with the smallest AIC. The AIC for T* is: 


AIC* — —pk + 2 pki 


where p* is, in this case, the number of terminal nodes of T*. This is justi¬ 
fied by a result of Stone (1977) who showed that the AIC is approximately 
equivalent to -Ave(Ajt) for (1-fold) cross-validation. 



Figure 3. Tree selection with the aid of Akaike’s Information Criterion. 

It is useful to draw a profile of ak versus AIC*, which will have, in 
general, the shape of Figure 3. The shape of the AIC curve may suggest 
choosing a simpler tree than the one corresponding to the minimum AIC. 
If the curve has an elbow (Figure 3), one can argue that the elbow tree is 
a good choice. This is analogous to the situation in factor analysis, when 
one chooses the number of factors corresponding to the elbow of the SCREE 
plot. 


6. EXAMPLES 


a) Prognostic Stratification 

There are several examples in the literature of prognostic strata defined 
by various prognostic variables for multiple myeloma (Bataille et al. y 1985). 
However, the variables used are not always available at every institute. Cer¬ 
tain blood tests or other diagnostic tests are not performed uniformly by all 
cancer centers which treat myeloma patients. Hence, it is desirable to de¬ 
fine prognostic stratification rules using the variables that are measured at a 
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given institute (in this case the Princess Margaret Hospital). The variables 
conveniently available at PHM are shown in Table 1, and were used by the 
Recursive Partition and Amalgamation Algorithms to define the tree shown 
in Figure 4. For the Recursive Partition Algorithm, the class SB of SDS was 
used to derive admissable splits, along with an exponential model likelihood 
dissimilarity measure. The criterion of homogeneity C a was: declare node 
N homogeneous if for any split {Pi y , P 2 y }, zither one of Pi y and P 2 y has 
less than 25 individuals, or d(Pi.,P 2 y ) is not significant at the a level. The 
results presented here are for a = .05. 


Table 1. Multiple Myeloma Variables Used by the Recursive 
Partition Algorithm for Prognostic Stratification 


1 . 

Haemoglobin level 

normal, mildly depressed, severely 
depressed 

2. 

Calcium level 

normal, mildly elevated, severely elevated 

3. 

Creatinine level 

normal, mildly elevated, severely elevated 

4. 

Alkaline Phosphatase 
level 

normal, mildly elevated, severely elevated 

5. 

White Blood Cell 
% Plasma 

0%, > 0% 

6. 

Bone Marrow % Plasma 

normal, trace, elevated 

7. 

Lytic Lesion count 

0-3 lesions, 4 or more lesions 

8. 

Age 

0-45, 45-60, over 60 years 

9. 

Symptom Duration 

0-29, 30 or more months 

10. 

M-protein risk factor 

Low risk, High risk 


For the Amalgamation Algorithm, the same dissimilarity measure (ex¬ 
ponential model) was used, in conjunction with the joining rule J a with 
a = 0.05. 

The algorithms yield the three terminal clusters labelled as stages 1 to 
3, shown in Figure 4. The Kaplan-Meier estimates of the survival curves for 
each stage appear in Figure 5. For the practicing clinician, such information 
is immediately useful. The strata-defining rules are phrased in clinically 
meaningful terms. 

A total of at most three decisions are required to classify any patient, 
and the prognosis in terms of expected years of survival is quickly available 
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Table 2. Multiple Myeloma Variables Used by the Stepwise 
Cox Proportional Hazards Regression Algorithm (BMDP2L) 


Z\ — I (Haemoglobin = mildly depressed) 

Z 2 — I (Haemoglobin = normal) 

Zz = I (Calcium = mildly elevated) 

Z± = I (Calcium = severely elevated) 

Z$ = I (Creatinine = mildly elevated) 

Z$ = l (Creatinine = severely elevated) 

Z 7 = l (Aik. Phos. = mildly elevated) 

Zg = I (Aik. Phos. = severely elevated) 

Z 9 = I (White Blood Cell % Plasma > 0%) 
Zio = I (Bone Marrow % Plasma = trace) 

Zu = I (Bone Marrow % Plasma = elevated) 
Z l2 = I (4 or more lytic lesions) 

Z\z = I (Age 45-60 years) 

Z 14 = I (Age over 60 years) 

Z 1B = I (Symptoms lasted 30 or more months) 
Zie = I (M-protein risk factor = high risk) 

/(x) = 1 if the logical statement X. is true. 

JT(x) — 0 if the logical statement X is false. 


from the estimated survival curves. Contrast this with the usual current 
statistical methodology for evaluating prognostic factors—analysis via the 
Cox model. The same prognostic factors as shown in Table 1 were recoded 
as 0-1 indicator variables as shown in Table 2, and used in a stepwise step-up 
Cox model algorithm (BMDP2L). The a-level to enter was set at a = 0.05. 
The final variables selected appear in Table 3. Patients could be divided 
into prognostic strata by examining the distribution of the regression score 
s = f}-z of the 302 patients, and splitting this distribution into three or four 
disjoint subgroups. Such strata are, however, difficult to justify and explain 
to clinicians, as the strata-defining rules will not in general be clinically 
meaningful. 

We can compare the recursive partition model to the stepwise Cox model, 
by defining two indicator variables for the three stages of Figure 4. The 
indicator variables are demonstrated in Table 4. Fitting a Cox model yields 
the results shown in Table 5. The RP model requires two parameters instead 
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Table 3* Step-up Stepwise Fitted Cox Model 
(a = 0.05 Level of Significance) 


Variable 

! Description 

Parameter 

estimate 

Chi-square 

p-value 

^2 

HG = normal 

$2 

= -0.6769 

23.70 

< 10- 4 


CA = sev. elev. 

& 

= 0.5805 

7.25 

0.0071 

Zu 

Age > 60 

$14 

= 0.3327 

6.76 

0.0093 

Z u 

Symp. > 30 months 

^15 

= -1.2100 

9.99 

0.0016 

z s 

AP sev. elev. 

$8 

= 1.0879 

6.87 

0.0088 

Z\\ 

BM % P elevated 

$n 

= 0.4111 

6.42 

0.011 

Zio 

BM % P trace 

$io 

= 0.3986 

4.33 

0.038 


No. of patients = 302 No. of parameters = 7 

log likelihood = -1267.70 AIC = 2549.416 


of seven, and is to be preferred in terms of Akaike’s Information Criterion. 
b) Propensity Stratification 

When performing retrospective analyses on a group of variously treated 
patients, any identified difference in behaviour of one subgroup versus an¬ 
other may be due to the differing treatments used. If proportionately more 
males receive treatment A than B, while more females receive B than A, 
an analysis including sex may show sex to be an important factor when in 
fact, sex is unimportant but treatment A is superior to B . Conversely, when 
examining treatment A and B, a perceived benefit for A may really be re¬ 
flecting the biological fact that males fare better than females for the given 
disease. Hence, identifying treatment imbalances among patient subgroups 
is a valuable aid in understanding effects observed in retrospect. 

In a retrospective analysis of small cell lung cancer patients at Princess 
Margaret Hospital, various prognostic variables were identified. However, 
prophylactic cranial irradiation (PCI) treatment was used on some of the 
patients. Before making conclusions about the prognostic factors identified, 
the distribution of PCI among the patients needs to be clearly understood. 
Table 6 shows four of the most important prognostic factors, within which 
the distribution of PCI needs to be investigated. For most patients, a record 
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Recursive Partition with Boolean Combinations. 
Prognostic Stratification Tree for 
302 Myeloma Patients - P.M.H. 1960-80 



Figure 4. 



Time (years) 


Figure 5 
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Table 4. Variables Defined by the Terminal Nodes 
of the Recursive Partition Tree of Figure 4 


Zi = I (patient = stage 2) 
Z 2 = l (patient = stage 3) 


Table 5. Cox Model Defined by Recursive Partition 
and Amalgamation Algorithms 

(a, = 0.05 Level of Significance) 


Variable 

Description 

Parameter 

Chi-square 

p -value 

Risk 



estimate 



(relative to 
Low risk) 

Zi 

Stage 2 

0i = 0.6187 

20.09 

< 10- 4 

1.8565 

Z 2 

Stage 3 

02 = 1-6201 

63.11 

< 10- 4 

5.0537 


No. of patients = 302 No. of parameters = 2 

log likelihood = -1270.1835 AIC = 2544.367 


of whether or not PCI was administered can be found in their medical charts, 
but for ten percent of the patients no reference to PCI can be found. Hence 
there are three treatment categories to investigate: (1) PCI administered, 
(2) PCI not administered, and (3) PCI assignment status unknown. The 
response variable PCI thus has 3 levels, suggesting the use of a multinomial 
model. 

For the recursive partition algorithm, the class SB of SDS was used, 
with a multinomial likelihood dissimilarity measure. The criterion of ho¬ 
mogeneity C a , a = 0.05 was used as previously described, however, in this 
case d{P\^P 2j ) is distributed (asymptotically) as x* 2 ) s i nce there are three 
classes for the response variable. 

The recursive partition algorithm yields the tree shown in Figure 6, 
and reveals three subgroups with widely differing PCI assignment rates. 
All of the “PCI assignment not stated” (N/S) patients are limited disease 
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Table 6, Small Cell Lung Cancer Variables 
Used by the Recursive Partition Algorithm 
For Propensity Stratification 


Stratifying variables: 

Age 0-49, 50-59, 60-69, 70-99 

Sex Male, Female 

Performance Status 1, 2, 3, 4. 

Extent of Disease Limited, Extensive 

Response variable: 

Treatment includes 
Prophylactic Cranial 

Irradiation (PCI)? Yes, No, Not stated 


patients (disease localized in the lung), a good prognostic subgroup. Since 
the distribution of N/S patients is so systematic, some understanding of their 
probable PCI assignment may be possible through further discussions with 
the clinicians involved. Also, within this good prognostic subgroup, the PCI 
assignment rate is the lowest, as one would expect from a medical treatment 
decision point of view. Males with extensive disease (disease not localized 
to the lung alone), one of the prognostically worst subgroups, show a PCI 
assignment rate of 60%, whereas females with extensive disease show a rate 
of 53%. 

Hence, there is some confounding of PCI with sex for patients with 
extensive disease, and an overall confounding of PCI with extent of disease. 
These facts must be borne in mind when investigating either the benefits of 
PCI, or the prognostic value of sex and extent of disease. 

c) Treatment’Response Stratification 

Often in clinical trials, the magnitude of the difference in effect of two 
treatments will be larger within some subgroups of the patient population 
than the magnitude within other subgroups. Identification of such subgroups 
is important in deciding where the treatments should be used, especially if 
one treatment is more toxic or costly and doing relatively little better than 
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Recursive Partition 

Propensity Stratification for 
640 Small Cell Lung Cancer Patients 

P.M.H. 1975 - 1983 



P(PCI) S 0.532 P(PCI) = 0.599 

P( N/S) = 0.00 P (N/S) =0.00 

Figure 6. 


another for a portion of the patients. 

From 1971 to 1979 a series of clinical trials for ovarian cancer patients 
were conducted at Princess Margaret Hospital comparing pelvic irradiation 
and abdominopelvic irradiation. The patients were all FIGO state IB, II 
and asymptomatic stage III; with a completed bilateral salpingo oophorec¬ 
tomy hysterectomy. The patients received either radiation to the pelvic 
area (pelvic irradiation), or radiation to the pelvic area plus to the abdomi¬ 
nal lymph node area (abdominopelvic irradiation). Overall, abdominopelvic 
irradiation was shown to be superior to pelvic irradiation, especially for de¬ 
laying recurrence (Dembo and Bush, 1982). But is the observed effect true 
for the entire patient population investigated? 

Five of the most important prognostic variables are shown in Table 7a; 
these are the predictor variables z. The treatment variable x, a 0-1 indicator 
variable for treatment, is shown in Table 7b. This variable and the censored 
survival time constitute the criterion variables u. 

For the Recursive Partition Algorithm, the class SB of SDS was used, 
along with a Cox likelihood dissimilarity measure to test for differences of 
regression equations (see equation (6), Section 3). The indicator variable 
x, (x = 0 or 1 according as the patient received pelvic or abdominopelvic 
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Table 7. Ovarian Cancer Variables Used by the Recursive 
Partition Algorithm for Treatment-Response Stratification 

a) Criterion variables, b) Treatment variable 


a) 


1 

Age 

< 40, > 40. 

2 

Histology 

undifferentiated, serous, endometroid, mucinous. 



clear cell. 

3 

Grade 

well differentiated, moderately diff., poorly diff. 

4 

Stage 

IB, II, III. 

5 

Residuum 

none, small amount. 


b) 


X = I (treatment = Abdominopelvic irradiation). 


irradiation), was used as the regression treatment covariate. The criterion 
of homogeneity C a was as in the previous example. 

The algorithm yields the tree shown in Figure 7. The relative risk es¬ 
timates (R.R.) from the Cox model for each stratum are also shown (R.R. 
= exp (fa), j = 1,2). The splitting rule identifies a favourable subgroup 
of patients (no residual disease after BSOH, age > 40, and Stage = IB or 
II). Within this favourable subgroup, the addition of abdominal lymph node 
irradiation reduces the rate of disease recurrence by a factor of 1/0.39 ~ 2.5. 
The accompanying Kaplan-Meier estimates of the relapse-free curves for the 
two treatment groups are shown in Figure 8. The parameter estimate 
is significantly different from zero (p = 0.0027). Within the complementary 
subgroup, the parameter estimate /3 2 is not significantly different from zero 
(p = 0.068), indicating little difference between the two treatment methods 
(Figure 9). Hence, there appears to be a subgroup of the selected patient 
population that benefits from the extra abdominal irradiation, while the 
complementary patient population shows no benefit. 

Identification of such subgroups is an important aspect within the frame¬ 
work of a long-term series of clinical trials, so that treatments can eventually 
be matched to that subgroup of patients where the effects are greatest. 
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Recursive Partition with Boolean Combinations 
Subgroup Regression Tree 
276 Ovarian Cancer Patients - P.M.H. 1971-79 



Relative Risk Estimates 

RR (No) _ RR (No) 

Pelvic XRT 1.00(42) 1.00(40) 

Abdominopelvic XRT 0.39(78) 1.54 (116) 

Figure 7. 

d) Minimum AIC plots 

Returning to the myeloma data example, we can vary the level of a of 
the C a criterion and calculate the AIC for the resulting trees. The AIC’s 
found by varying a from 0.01 to 0.90 are shown in Figure 10. The AIC 
plot suggests that the trees corresponding to a = 0.15 (the “elbow” tree) or 
a = 0.20 are in some sense to be preferred. The elbow tree is shown in Figure 
11. The prognostic stratification rule now yields four patient subgroups. 
The Kaplan-Meier survival curve estimates for each subgroup are shown 
in Figure 12. Table 8 shows the three indicator variables defined by the 
elbow tree, and Table 9 lists the results of the Cox model derived from 
these indicator variables. Notice that the AIC for this elbow tree is the 
smallest AIC of the three Cox models fitted to the myeloma data (Tables 
3, 5 and 9). The likelihood for this elbow tree is also the smallest of the 
three model likelihoods. The AIC plot as a guide to model selection is much 
more informative than merely setting ot = 0.05, as is traditional in classical 
statistics. 


7. DISCUSSION 

We have outlined a general framework for defining strata from biosta- 
tistical data. 

The main novelty of this formulation is its generality and flexibility. One 
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Figure 8. 



Figure 9. 
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Table 8. Variables Defined by the Terminal Nodes 
Of the Recursive Partition Tree of Figure 11 


Zi = I (patient = stage 2) 
— I (patient = stage 3) 
Zz = I (patient = stage 4) 


Table 9. 


Variable 

Description 

Parameter 

estimate 

Chi-square 

p-value 

Risk 

(relative to 
Stage 1) 

Zi 

Stage 2 

/?i = 0.6067 

17.06 

< 10" 4 

1.83 

Z* 

Stage 3 

@2 = 1-1527 

20.18 

< 10- 4 

3.16 

Zz 

Stage 4 

/? 3 = 1-7895 

69.44 

< 10~ 4 

5.98 


No. of patients = 302 No. of parameters = 3 

log likelihood = -1265.7825 AIC = 2537.56 


needs to specify the language in which strata are to be defined, the objec¬ 
tive of the stratification, and the homogeneity criterion to decide whether 
a population is homogeneous in the sense specified by the objective of the 
stratification. In our formulation, the language is formed from the SDS class, 
the objective is embodied in the definition of dissimilarity, and the homo¬ 
geneity criterion is formulated on the basis of the statistical properties of 
the dissimilarity measure (large sample approximation). Once the choices 
are made, RP, and possibly amalgamation, provide algorithms for obtaining 
strata. 

The general framework permits one to generalize the “node impurity 
measures” used by Breiman et al (1984) to the situation in which the 
objective of the stratification is defined on the basis of censored survival 
data (e.g., prognostic stratification), of categorical response (e.g., propen- 
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AIC 4328 


4326 


Elbow tree 


1 / 

X r I 

\ Minimum AIC tree / 

V / 

x-x- x-x—x-x-x—x-x-x 
J_._L 


X-X-X 


0 

0.2 

0.4 0.6 0.8 

a 

a 

0.01 -0.10 

0.15 

0.20-0.65 

0.70-0.90 

AIC 

4330.15 

4327.65 

4326.25 

4327.91 


Figure 10. Choice of tree with the aid of Akaike } s Information Criterion. 

sity stratification), or of a relationship between variables (e.g., treatment 
response stratification). In particular, we have discussed a class of dissim¬ 
ilarity measures based on likelihood ratio statistics which covers many of 
the situations encountered in biostatistics. Within this framework, most 
of the ideas developed by Breiman et al. (1984) can be implemented with 
minor modifications. Indeed, our work strongly relies on appropriate adap¬ 
tations of some of these ideas (missing data, variable importance, Boolean 
combinations, etc.). 

We have also attempted to formulate the problem of model selection 
and to suggest an approach slightly more general than that of Breiman et 
al. (1984) via the idea of resampling schemes. This formulation and the use 
of likelihood based measures of model significance allows us to use the AIC 
instead of the more expensive cross-validation or bootstrap calculations for 
the selection of “honest” trees. 

The examples discussed in this paper illustrate the flexibility of the 
framework and its usefulness in the clinical context. They show that RP 
produces results of clear interpretability and are therefore attractive to physi¬ 
cians. From a statistical point of view, and if we adopt the AIC as a criterion 
for model comparison, the examples show that RP based models compare 
well or favourably with those obtained by stepwise regression. 

The emphasis of this paper has been on the general formulation. Clearly 
many points need clarification and further investigation. For example, the 
choice of dissimilarity measures other than the likelihood ratio could be more 
desirable in certain applications. We have implemented, in the case of prog- 



RECURSIVE PARTITION 


47 


Recursive Partition with Boolean Combinations and 
Amalgamation Algorithms. 

Prognostic Stratification for 302 
Myeloma Patients - P.M.H.I960-80 


Elbow tree (a = 0.1 5) 



nostic stratification, a variety of alternatives to the exponential model and 
the Cox model likelihood ratio statistics. The results, however, do not seem 
dramatically different from those obtained with the choices discussed in this 
work; moreover, there seems to be no substitute for the AIC criterion in 
“honest” tree selection. The AIC criterion is equivalent to a cross-validation 
criterion only under large sample approximation (Stone, 1977). However, 
the validity of the approximation has not been discussed here and will be 
the object of a separate study. Also, the comparative advantages and disad¬ 
vantages of cross-validation, bootstrap and jackknife in model selection have 
not been studied in this work but merit serious attention. 

Comparisons between regression models and RP trees also need a more 
systematic approach than the one outlined here. In fact it would be highly 
desirable to integrate RP and regression. It is by no means clear that one 
should perform uniformly better than the other. On the contrary, the two 
approaches appear to be complementary. 

There is a fundamental criticism that can be raised against RP methods, 
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Time (days) 

Figure 12. 

or rather, against their careless use: RP, just like stepwise regression and- 
factor analysis, remains principally an exploratory method and should not 
be used to make extraordinary claims. There is no substitute for indepen¬ 
dently collected data if one wishes to evaluate the merit of a data generated 
hypothesis. Similarly, there is no substitute for controlled clinical studies if 
one needs a reasonable degree of certainty about a medical theory. 

On the other hand, methods like RP, especially if strengthened by ap¬ 
propriate model selection procedures such as those outlined in Section 5, 
are a useful tool in obtaining as much information as possible from existing 
data. They have the advantage over more traditional stepwise regression 
techniques of producing results in clinical terms, thus facilitating the com¬ 
plex task of hypothesis formulation. 

The problem remains that exploratory methods are difficult to evaluate 
from a statistical point of view. There seems to be no alternative to the 
use of intense, complex and costly simulations in order to explore the limits 
of validity of a method like RP. The task is thankless, but necessary and 
important. 
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STATISTICAL RAMIFICATIONS OF LINEAR 
ANALOGUE SCALES IN ASSESSING 
THE QUALITY OF LIFE OF CANCER PATIENTS 

ABSTRACT 

In evaluating cancer treatments, it is important to know both about 
the quality and duration of the patient’s survival. This study was initiated 
to investigate the properties of scales that could be used to assess quality 
of life. 115 breast cancer patients from the Princess Margaret Hospital in 
Toronto were asked to evaluate their quality of life using linear analogue 
scales. Thirty-one variables covered the areas of general health and major 
problems associated with breast cancer derived from clinical experience and 
the opinions of patients with the disease. Patients found the linear analogues 
simple to use, and the broad areas considered by each scale made the men¬ 
tal effort much less than what would be involved with extensive evaluative 
inventories. 

Our general health items were compared with the Sickness Impact Profile 
from which they were derived to assess the validity of the method. Our 
patient population produced reliable test-retest results that were stable for 
both factor analysis and multivariate regressions. Also, we can report the 
results of some sample size investigations for both reliability and stability of 
data collected in this and other studies. 
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1. INTRODUCTION 

In evaluating cancer treatments, it is important to know both about 
the quality and duration of the patient’s survival. A lot of work has been 
directed at survival analyses, but relatively little has been done to assess the 
patient’s quality of life. Our data were to be collected in a clinical setting so 
the approach needed to be simple and quick; large health inventories would 
be unwieldy for general use. 

Priestman and Baum (1976) looked at the quality of life in advanced 
breast cancer patients with Linear Analogue Self-Assessment (LASA) scales. 
They used a 10 centimeter line with the ends of the line labelled with de¬ 
scriptive extremes of a symptom. The patients marked the position on the 
line corresponding to their own feeling about the symptom. The results from 
all ten indices were then summed to give an overall value. There was a high 
correlation (0.87) between scores obtained in the presence of a doctor and 
those 24 hours later, and significant changes in LASA scores were observed 
to correspond with clinical impressions. 

We used thirty-one scales to cover twenty-eight of the patient’s general 
health and specific disease-related items and one scale as a global assessment 
of overall lifestyle. These issues have been discussed elsewhere (Selby et al, 
1984) and attention will be focused here on statistical points. 


2. DATA 

The study population was 115 breast cancer cases in clinics at the 
Princess Margaret, a cancer treatment hospital in Toronto. The cases had 
recurrent disease or were receiving adjuvant therapy. We wanted patients to 
assess the effects of disease or treatment on their lives so they were instructed 
to consider their lives relative to what would be normal for them . 

e.g. Physical Activity 


completely 

unoble 

to move h 

my body 

LASA score = #cm 
(maximum score = 10cm) 


normal 
physical 
activity 
for me 


The areas covered are listed in Table 1. 
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Table 1. Areas Covered by Linear Analogue Scales 

General Health Disease Related 

Work 
Writing 
Concentration 
Recreation 
Housework 
Social Life 
Physical Activity 
Mobility 
Self Care 
Family Relations 
Eating 
Sleep 
Speech 
Anger 
Depression 
Anxiety 

Global 

Uniscale — global assessment of overall quality of life 

Our general health categories were derived from the Sickness Impact 
Profile (SIP) of Bergner et ai (1981) so a group of 52 patients completed 
both forms to permit comparison of the two approaches. SIP involves a 
long list of questions for which the patient checks the presence or absence 
of. The responses to questions are then weighted and grouped to give the 
patient’s status for particular areas of life; the overall index is derived directly 
from the scores for the subcategories. The same group of patients had a 
physician independently assess their status with linear analogue scales and 
the Karnofsky performance scale. 

Our disease-related items were obtained from clinical experience and 
the opinions of patients with the disease. Three areas were given two scales: 
sleeping less or more, eating less or more, and constipation/diarrhoea. These 
were then combined to make three continuous variables for sleeping, eating, 
and bowels. 

Medical information was also available on the patient’s age, presence of 
other syndrome, whether she had had a mastectomy, site of disease, and 
times from diagnosis to LASA, recurrence to LASA, and diagnosis to recur- 


Dysuria 

Pain 

Bowels 

Sore Mouth 

Breathing 

Fatigue 

Nausea 

Vomiting 

Attractiveness 

Hair Loss 

Appearance 

Information (about disease) 
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rence. These medical variables were considered because they might affect 
responses about quality of life. 

3. METHODOLOGY 

The properties of these scales were investigated from a pragmatic view¬ 
point. The scales appeared to be acceptable for patients in that they re¬ 
sponded positively to taking part in the study 89% of the time, and the 
form usually took only about 5 minutes to complete. It should be noted 
that the methods and results reported here form a summary of several re¬ 
lated investigations on different subsets of data so that not all the patients 
were used for each. 

Reliability was examined in two different ways. Ninety-six patients com¬ 
pleted LASA in outpatient clinics and then 9-12 hours later completed it 
again at home. It was not anticipated that a patient’s clinical status would 
change in this time frame so product moment correlation coefficients were 
calculated to examine the relationship between the two sets of scores. As 
well, intraclass correlation coefficients were used for LASA’s completed in 
the clinic, at home 9-12 hours later, and a week after their first LASA. 
The product moment and intraclass correlation coefficients were repeated 
without the group of people who had fairly normal lives, with scores >9.7. 

We looked at the validity of the results with 52 patients. Product mo¬ 
ment correlation coefficients were used to compare the patient’s LASA scores 
with those linear analogues given by the physician (LAPA) as well as to 
compare areas in common with SIP and the Karnofsky performance sta¬ 
tus. Meanwhile, the LASA scores for metastic and adjuvant chemotherapy 
patients were compared to see if changes in quality of life associated with 
treatment or the progression of disease could be inferred from changes in 
LASA scores. 

Multivariate analyses were used in two contexts. Factor analysis was 
employed to examine the relationships of the twenty-eight specific LASA 
categories. Varimax and promax rotations were used to define five factors, 
with each variable within a factor having loading greater than 0.3. The pa¬ 
tient’s sum of the variable values for each factor was used to calculate prod¬ 
uct moment correlation coefficients for the two LASA’s completed within 12 
hours. 

There is a medical interest in the relationship between the global mea¬ 
sure of quality of life, UNI, and individual variables. Priestman and Baum 
(1976) gave uniform weights to their items while Bergner et al . (1981) used 
the opinions of professionals, and factor analysis defines the association of 
variables by quantity of impairment. We felt that none of these approaches 
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adequately handled the situation since in the first instance a woman might 
not consider restrictions in some areas of life as important as in others, in 
the second she might have different views from a professional on what was 
important to her, and lastly even within a factor analysis grouping she might 
consider some variables of greater or lesser relevance. Our UNI scores were 
derived from a separate questions about overall lifestyle so a simplistic ap¬ 
proach was chosen of performing a regression with UNI as the dependent 
and the other LASAs as the independent variables. This amounts to an ad 
hoc way of incorporating a patient’s view of item relevance. 

The original data were very non-normal. Priestman and Baum (1976) 
had used an arcsine transformation on their data. After trying several trans¬ 
formations, we decided to use log{arccos(variable/10)} which incorporates 
the aesthetic flipping of the unaffected LASA scores of 10 to 0. Forward 
stepwise linear regressions were used on the LASA and LAPA scores. The 
residuals were looked at with P-P plots as well as for any evidence of non¬ 
linear relationships. 

Sample size considerations can be deduced from comparison with similar 
patients in other studies where various sample sizes were used to answer 
different medical questions. Product moment correlation coefficients were 
computed for samples of size 25, 60, and 96 to measure the repeatability of 
the LASA scores, where the 96 patients are from the present data set. In 
a social science context, absolute values of say > 0.70 are usually used as 
measures of reliabilty, but we have considered a statistical criterion here of 
significance levels for testing whether a correlation coefficient is greater than 
0. The stability of regressions can be compared for samples of 60 and 75 
caes, where the 75 cases are from the present study. 

4. RESULTS 

Table 2 shows the distribution of correlation coefficients for the com¬ 
parison of data from outpatient clinics and values 9-12 hours later. All 
the correlations were significantly greater them 0 at the 1% level. The in¬ 
traclass correlations yielded similar results except for nausea and vomiting 
which were very close to zero; it should be noted that these were two of the 
three variables with product-moment correlations less than 0.60. Restricting 
the cases considered to those LASA values < 9.7 did not affect the results 
greatly. 

The LASA and LAPA scores are compared in Table 3 while the LASA 
scores are compared with SIP and Karnofsky in Table 4. There is no real 
evidence of disagreement between the LASA scores and the other indicators 
of patient status. Substantial differences were also noticed between LASA 
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Table 2. Reliability of Linear Analogue Scales 

Ranges for 
Correlation 

Coefficients # Variables 

[0.80,1.00] 11 

[0.70,0.80) 13 

[0.60,0.70) 5 

[0.00,0.60) __3 

32 


Table 3. Comparison of Patient and Physician Scores 

Ranges for 
Correlation 

Coefficients jf= Variables 

[0.80,1.00] 5 

[0.70,0.80) 13 

[0.60,0.70) 6 

[0.00,0.60) _8 

32 


scores for metastatic and adjuvant chemotherapy patients. 

The five factors defined by factor analysis are given in Table 5. In broad 
terms the factors may be described as covering the areas of activity level 
and mental alertness, physical status, emotional, digestive, and physical at¬ 
tributes. Table 6 contains the correlations for these factors between repeated 
measure of the LASA 9-12 hours apart. All the factors except for the diges¬ 
tive one, composed of nausea, vomiting, and eating, had good repeatability 
with correlations over 0.80. 

Stable regressions were obtained with around 75 cases; the addition or 
removal of about 10 cases had only a minor affect on the results. The 
final model could be considered as including physical activity, anxiety, social 
life, appearance, depression, housework, speech, and diagnosis to recurrence 
time, and accounted for 70-84% of the variation. The LAPA scores were 
more precise than the LASA scores with stable regressions with about 45 
cases, but only physical activity wa & in common with the LASA model (80% 
of the variation was accounted for by the LAPA model). There was no 
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Table 4. Comparison of Linear Analogue Scales 
With Other Measures of Patient Status 

With Sickness Impact Profile: 

Ranges for 
Correlation 

Coefficients # Variables 

[0.80,1.00] 1 

[0.70,0.80) 3 

[0.60,0.70) 2 

[0.00,0.60) 

7 

Common areas covered: overall score, mobility, housework, work, recreation, 
eating less, concentration vs. alertness behaviour. 

With KARNOFSKY PERFORMANCE STATUS: 

Correlation Coefficient with LAS A UNI = 0.62 

Table 5* Factors Defined by Factor Analysis 1 

Factor 1: Mobility, Physical Activity, Recreation, Fatigue, 

Social Life, Housework, Writing, Concentration 

Factor 2: Pain, Physical Activity, Bowel Habit, 

Breathing 

Factor 3: Depression, Anger, Anxiety, Appearance, 
Concentration, Family Relations 

Factor 4 •* Nausea, Vomiting, Eating 
Factor 5: Attractiveness, Family Relationships, Hair Loss 
1 All variables had factor loadings greater than 0.3. 

evidence of substantial departures from normality in the P-P plots after the 
regression or in the other residual plots that the linear model is inadequate. 
One concern arose from the regressions; when the same individuals were 
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Table 6. Comparison of Factors for Repeated 
Linear Analogues 9-12 Hours Apart 

Factor Correlation Coefficient 


1 

2 

3 

4 

5 


0.86 

0.86 

0.81 

0.42 

0.88 


involved in several repeated LAS As, the inter variable correlations increased 
markedly with the number of times that the LASA was repeated. 

Table 7 contains the results of sample size observations. With 96 cases, 
all the correlation coefficients were significantly greater than 0 at the 1% 
level while with around 75 cases there appeared to be quite stable regression 
results. 


Table 7. Sample Size Observations 


RELIABILITY: 

# Cases Observation 

96 1 all the correlation coefficients were significantly 

greater than 0 at the 1 % level 

60 5 variables had correlation coefficients not 

significantly greater than 0 at the 5% level 

25 9 variables had correlation coefficients not 

significantly greater than 0 at the 5% level 

REGRESSIONS: 

# Cases Observation 

75 1 in 4 similar regressions 6 of 9 ‘final’ subset 

variables are the same, 2 other variables are included 
in 3 of the final models 

60 only 4 of 9 or 10 ‘final’ subset variables are the 

same in various regressions 

1 Cases are subsets of the 115 being looked at here; all others are from other 
studies. 
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5. DISCUSSION 

Linear analogue scales are simple to use and accepted readily by patients 
in a clinic setting. It appears that they produce quite repeatable results in 
directions that are supported by other measure of patient status. 

The scales also yielded data sets with fairly stable multivariate models. 
It is quite reasonable that the digestive factor from factor analysis might 
have much lower repeatability than the others. It is also reasonable that the 
physician scores would be more precise given that they lack the interperson 
variability of the LASA scores and were completed by a healthy individual. 
The physician though would not likely have known the true impact of disease 
or treatment on a patient’s whole lifestyle; this could explain the overlap 
of the observable physical activity in the patient and physician regression 
models. It seems important from a medical point of view that a patient’s 
opinions of relevance of scale items be considered rather than the alterna¬ 
tives of uniform weighting, weighting by nondiseased experts or the weights 
by statistical association of impairment. Some caution would be advisable if 
many repeated measures are planned to follow treatment or disease progres¬ 
sion as there was evidence that the dimensions may be collapsing. Further 
investigation is required to ascertain better the nature of this. 

The linear analogues could be used to evaluate the patient needs on 
a group or individual basis with a view to providing assistance, especially 
with very toxic regimens. They could also be used to follow patient status 
and needs with time as disease progresses or with treatment as suggested 
by Priestman and Baum (1). However, caution would be urged, at least at 
present, regarding this latter use of the LASA scales discussed here, pending 
further clarification of the effects of repeated measures. 
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TESTS FOR TREND IN BINOMIAL PROPORTIONS 
WITH HISTORICAL CONTROLS: 

A PROPOSED TWO-STAGE PROCEDURE 

ABSTRACT 

Historical control information may be used to strengthen inferences made 
concerning the carcinogenic potential of substances subjected to long term 
bioassay. Previously proposed tests which incorporate historical control in¬ 
formation assume at legist tacitly that the concurrent and historical control 
data arise from the same distribution. In this paper, we introduce a two- 
stage procedure which utilizes historical data only if the response rate in the 
concurrent control group falls within a suitable tolerance interval defined in 
terms of the historical control series. The asymptotic distribution theory 
required to use the procedure is presented and applied to an actual set of 
bioassay data for which historical control information is available. 

1. INTRODUCTION 

Long term bioassays with small rodents continue to be widely used to 
assess the carcinogenic potential of chemical substances (Bickis and Krewski, 
1985a). These studies generally involve the administration of different levels 
of the test substance to groups of animals throughout the major portion of 
their lifespan, along with a similar group of unexposed animals comprising 
the concurrent controls. At the end of the study, the data are examined for 
the presence of a dose related increase in the occurrence of tumours among 

1 Health Protection Branch, Health and Welfare Canada, Ottawa, Ontario K1A 
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2 Department of Mathematics and Statistics, Carleton University, Ottawa, On¬ 
tario K1S 5B6 

3 Department of Statistics, George Washington University, Washington, D.C. 
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the exposed animals relative to controls. 

The use of historical control data in conjunction with the concurrent 
control data as a means of strengthening statistical tests was first proposed 
in the bioassay context by Tarone (1982). Since that time, related procedures 
have been proposed by Dempster et al. (1983), Hoel (1983), Krewski et al. 
(1985) and Smythe et al. (1986). A potential limitation shared by all of 
these procedures is their presumption that the concurrent and historical 
control data arise from the same distribution. Because of this, we propose 
to consider two-stage tests in which the historical control data is employed 
only if it appears to be consistent with the concurrent control results. 

In Section 2, the Cochran-Armitage test for increasing trend in binomial 
proportions commonly applied in the absence of historical control data is 
reviewed. The incorporation of historical control information into this pro¬ 
cedure in the spirit of Tarone (1982) and Krewski et al. (1985) is discussed 
in Section 3. The proposed two-stage procedure is then developed in Section 
4. This procedure involves the construction of a suitable tolerance interval 
based on the historical control series, with the historical data used only if 
the response rate in the concurrent control group falls within this interval. 
(See Hoel and Yanagawa, 1986, for a related approach in the finite-sample 
case.) The application of the proposed procedure is illustrated in Section 5 
using an actual set of bioassay data for which historical control information 
is also available. 


2. TESTS FOR TREND IN BINOMIAL PROPORTIONS 

Consider an experiment with k + 1 dose levels 0 = do < d\ < ... < 
in which x* of the n* animals at dose di respond (t = 0,1,...,A:). We 
assume that x* follows a binomial distribution B(ni,pi) where the response 
probability pi = P(d t ), with P(d) satisfying the logistic dose response model 

P(d) = [1 + exp{-(a + id)}]" 1 (2.1) 

(—oo < a, 6 < +oo) for d > 0. The likelihood function for the unknown 
parameters a and b given the data x = (xo, Xi,..., x*.) is 

L(«,6 l*) = ]l (! - (2-2) 


Treating a as a nuisance parameter, the score statistic for testing the null 
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hypothesis Ho : b = 0 versus the one-sided alternative Hi : b > 0 is given by 

d log L I 


Tca = 


db 


\a=&,b=Q 

= ^Txidi - pY^nidi, 


(2.3) 


where p = x/n is a consistent estimator of p = jpo with x — J2 x i an€ ^ 
n = ^2 rii, and a = — log((l — p)/p) is the maximum likelihood estimator of 
a under Ho : b = 0 (Tarone and Gart, 1980). The variance of this statistic 
is 

n-V(TcA) = » _1 p(l -p)(^2 n i d i - E n * d< ) 2 } 


(2.4) 


Here, a\ = Yh A*(d* ~ d) 2 with d = ^ A^d* and ~ denotes asymptotic equiv¬ 
alence as n —► oo with n»/n —> A* > 0. 

The standardized test statistic Sc a = T'ca/[V p (Tca)] 1 ^ 2 is commonly 
called the Cochran-Armitage test statistic after Cochran (1954) and Ar- 
mitage (1955). Because the x* are independent random variables, it follows 
from standard central limit arguments that 

lim Pr {Sc a < x} = $(x), (2.5) 

n—► oo 

where $ denotes the standard normal cumulative distribution function. 


3. INCORPORATION OF HISTORICAL CONTROLS 


To incorporate historical controls, we suppose that p = [1 + exp(—a)] -1 
varies from experiment to experiment according to a beta distribution with 
density 

< 51 > 

for 0 < p < 1 (a, (3 > 0). Then a = - log((l — p)/p)) has density 


g{a | a,P) = 


r(« + l) 
r(«TO) 


(-L-V (-£=_) 

Vl + e—J \l + e—J 


(3.2) 


The marginal likelihood for b is then given by 



64 


D. KREWSKI, R. T. SMYTHE AND D. COLIN 


The score statistic for testing Hq : b — 0 is now given by 

aiogi^, 

db l*=° (3.4) 

=: ^ ^ x%di p ^ ^ n^di y 


where p = (x + a)/(n +a +ft) (Krewski et al., 1985). Under Hq, E(T) = 0 
and 


n _ 1 V(T) = n -1 


aft 

(a + ft)(a + ft + l) 



1 

(n + a + /3) 



_ a P 2 

(a + ft)(a + ft + l) d ' 

The limiting null distribution of 5 = T /[V (T)] 1 / 2 is given by 
lim Pr{5 < x \ a, ft) = E p $(x/r p ), 

n—too 


(3.5) 

(3.6) 


where = p(l — p)(a/3)~ 1 (a + /?)(a + 0 + 1). The asymptotic distribution 
of S is thus a mixture of normal distributions with mean zero and variance 
E(r*) = 1. For applications, the critical value of S for testing Ho : b = 0 is 
well approximated by that based on the standard normal distribution (see 
Krewski et al, 1985, for further details). 

The test for trend using historical control data based on the statistic 
5 may be expected to perform well when the historical data and the con¬ 
current controls arise from the same distribution. Because of systematic 
differences between experiments, however, it is possible that the concurrent 
control response probability may follow a beta distribution with parameters 
(a*,j3*) 7 ^ (a,y0). In particular, if a*/(a* + 0*) > a/(a + 0 ), the value of T 
in (3.4) will tend to be inflated thereby leading to an excessive Type I error 
rate. 


4. A TWO-STAGE TEST WITH HISTORICAL CONTROLS 

In order to protect against inflation of the Type I error rate in cases where 
the historical and concurrent controls do not follow the same distribution, 
we propose the use of a two-stage procedure in which the historical control 
data is used only if it appears to be consistent with the concurrent control 
data. Specifically, let = [A:i(no), A? 2 (no)] be a 1007 % tolerance interval 
for xq satisfying 

*2 ("o) 

£ p r(*o) > 7 

x 0 =ki (n 0 ) 


(4.1) 
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(0 < 7 < 1 ), where the marginal distribution of xq is specified by 


Pr(x 0 ) 


n 0 \ T(q + /?) r(s 0 + a)r(n 0 - x 0 + /?) 
xq) r(o:)r(/?) r(n 0 + a + p) 


(xq = 0 , 1 ,..., no). The test statistic T* is then selected as 


-{ 


T if x 0 € 

Tca otherwise . 


The mean and variance of x 0 based on (4.2) are given by 

E(x 0 ) — n 0 9 


V(x 0 ) = n o {0(l - 9) + (n 0 - 1)A}, (4.5) 

where 9 = a/(a + fl) and A = a/3(a + /3)“ 2 (q: + /? + l)” 1 . The asymptotic 
distribution of z — [xo — E(xq)]/{V(xo)} 1 ^ 2 } is given by 

( 1 if z > (1 - 0)A -1 / 2 , 

lim Pr{z < y) = < F(9 + A x / 2 y) if -0A -1 / 2 < z < (1 - 0)A~ X / 2 , 

(0 if z < -0A -1 / 2 , 

( 4 - 6 ) 

where F is the cumulative distribution of the beta density in (3.1). The 
limiting density of z is thus 

<f>{z) = A x / 2 /(<? + A x ' 2 z). (4.7) 

In large samples, condition (4.1) may now be expressed as 


/*C 2 

/ </>{z)dz = 7 , 

J Ci 


where ci < 0 and > 0 may be uniquely defined by setting 

pci p(l — 0)A~ l / a 

/ <j>(z)dz = / <f>(z)dz = (1 - 7 )/ 2 . (4.9) 

J-0A- 1 / 2 Jc 2 

The interval I no = [&i(tto)> * 2 (^ 0 )] is then approximated by ki ~ n 0 (9 + 
c ‘ Al/2) - 

To obtain an approximate critical region for the two-stage procedure, we 
note that under the null hypothesis, 


Pr{rejecti2o} = Pr{rejecttf 0 | T* = T}Pr{T* = T} 

+ Pr{rejectff 0 | T* = T CA }Pr{T* = Tca}- 


(4.10) 
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Under the assumption that (a*,/?*) = (a,/?), Pr{T* = T} = 7 and Pr{T* = 
Tca} = 1 — 7 . Thus, in order to obtain an overall Type I error rate of 
6 (0 < S < 1), the Type I error rates given T* = T and given T* = Tca 
may both be set equal to 6 . 

Using arguments similar to those given by Krewski et al . (1985), the 
first term on the right hand side of (4.10) may be approximated by 


pB+c 2 A 1/2 f —Sc\ 

Pr{5 > s 0 | &i(n 0 ) < x 0 < k 2 (n 0 )} ~ 7 _1 / $ ( - ) g{p)dp 

' 0+c 1 A 1 / 2 \ r p / 


( 411 ; 

where r\ is defined in (3.6). The critical value 8q for S conditional on 

£0 G / no is thus obtained by setting 


r$+c 2 A 1/2 

7 -1 / i(-s 0 /r p )g{p)dp ■=■ S. 

Je+c 1A 1 / 2 

(4.12) 

Similarly, the critical values for Sc a conditional on xo ^ J no 

are determined 

by setting 

Pr{ScA > si | x 0 < fci(no)} ~ $(~«i) 

(4.13) 

and 


P t{Sca > $2 | x 0 > fc 2 (n 0 )} ~ $(-$ 2 ) 

(4.14) 

equal to S 



5. AN EXAMPLE 

As an example of the application of the proposed procedure, consider the 
data given in Table 1 on the occurrence of alveolar-bronchiolar adenomas in 
mice taken from the data base examined by Bickis and Krewski (1985b). 
As shown in Table 2 , the Cochran-Armitage test applied without the use 
of historical controls does not lead to a significant result at the nominal 
5% level of significance. Application of the single-stage test with historical 
controls, on the other hand, yields a p-value of 0.017. Because the response 
rate 6/54 = 0.11 in the concurrent control exceeds the mean response rat 
a/{a + /?) = 0.08 in the historical controls, however, it is not clear which of 
these two tests is more appropriate. 

The results of the two-stage test shown in Table 2 indicate that with 7 = 
0.75 or 0.90, the historical control data is used, with the p -values conditional 
on £0 El being only trivially larger than the unconditional p -value of 0.017. 
With 7 = 0.60, however, Xo/no = 0.11 > 0.105 = fc 2 /n<) an< l Cochran- 
Armitage statistic is used. 
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Table 1. Occurrence of Alveolar-Bronchiolar Adenomas 
in Mice Following 102 Weeks on Test . 

(no. with tumours/no. at risk) 


Historical Controls® 


0/20 

0/12 

1/19 

2/25 

1/10 

8/49 

0/20 

0/12 

1/19 

4/47 

2/20 

3/18 

0/19 

0/10 

1/17 

2/22 

3/20 

4/20 

0/17 

1/20 

1/15 

2/20 

3/20 




Current Experiment 


Dose: 

0 1 

2 

Xi/rii-. 

6/54 5/54 

10/54 


a ot = 14.2, p = 160.0 


6 . CONCLUSIONS 

The use of historical control information can appreciably increase the 
sensitivity of tests for increasing trend in tumour occurrence with increasing 
dose. When the underlying response rate in the concurrent control group is 
substantially greater than the mean response rate in the historical controls, 
however, the Type I error rate of this procedure may be in excess of the 
nominal level. Because of this, a two-stage procedure is proposed in which 
the historical control data is used only if the concurrent control response rate 
falls within a suitable tolerance interval defined in terms of the historical 
control distribution. 

Before this two-stage procedure can be recommended for routine use 
in practice, further study of its properties is required. In particular, the 
degree to which the Type I error rate of the single-stage test with historical 
controls may be inflated is currently under investigation. Also under study 
is the selection of a suitable tolerance level 7 for use in defining the tolerance 
interval to be used in the first stage of the two-stage procedure. 
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Table 2. Tests for Trend With and Without Historical Controls 


Historical 


Test 


Controls 

Test Procedure 

Statistic 

p-value 

No 

Cochran-Armitage 

S CA = 1.15 

0.126 

Yes 

One-Stage 

5 = 2.13 

0.017 

Yes 

Two-Stage 

7 = 0.90; 

% € 1= [0.047,0.123] 

S* = S — 2.13 

0.018 


7 = 0.75; 

^ €/ = [0.054,0.111] 

5* = 5 = 2.13 

0.019 


7 = 0.60; 

£ 1= [0.059,0.1052] 

S * = Sca = 1.15 

0.126 
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POINT ESTIMATION OF THE ODDS RATIO 
IN SPARSE 2x2 CONTINGENCY TABLES 

ABSTRACT 

The performance of a number of alternative point estimators of a single 
odds ratio and its logarithm is evaluated in defined sets of 2 X 2 contingency 
tables with small cell frequencies. The estimators considered are: four ver¬ 
sions of the unconditional maximum likelihood estimators, modified to be 
defined when zero cells occur in the table; the conditional maximum like¬ 
lihood estimator, similarly modified; and two pseudo-count or “shrinkage” 
estimators, proposed by Fienberg and Holland (1970). 

The sets of contingency tables are generated by considering all of the 
possible outcome frequencies in the table which are consistent with a single 
fixed margin; the probability of each table in the set is derived as a product 
of binomial probabilities associated with the realisations in each row of the 
table, corresponding to the table having arisen through the drawing of two 
independent binomial samples with known probabilities of “success” in each. 
The mean, variance, bias, mean squared error and average absolute error of 
each of the estimators are obtained and compared. Contrary to previous 
simulation work in 2 X 2 x k tables, it is found that the unconditional MLE 
is usually preferred to the conditional MLE in single 2x2 tables. It is 
possible that this finding is due in part to the modifications used to make 
the estimators be defined when zero cell frequencies occur. 


1. INTRODUCTION 

We will consider the estimation of the odds ratio in a single 2x2 con- 
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tingency table, denoted as follows: 

# of Successes # of Failures Sample Size 

Sample la b n\ 

Sample 2 c d 


We will suppose that such a table has been generated by drawing two inde¬ 
pendent binomial samples of size fix and ri2, with probabilities of “success” 
on each sample of pi and P2 respectively. 

The odds ratio in such a table is defined as pi(l — P 2 ) /[p2(l — Pi)]> and 
will be denoted by p. We will be concerned with the estimation of p, and 
also of its logarithm, (3 = ln(p). Previous literature suggests that there may 
be some advantage of working with /? rather than p, because it tends to have 
a more symmetric and approximately normal sampling distribution, p has 
a range of 0 to 00, with a null value of 1 , whereas (3 has a range of —00 to 
00, with a null value of zero. 

The odds ratio has been widely used as a measure of association in 
contingency tables, enjoying a number of advantages of estimability and in- 
terpretability not possessed by other indices. Indeed, the odds ratio is the 
fundamental parameter underlying the analysis of multidimensional contin¬ 
gency tables by log-linear models. The odds ratio is used frequently in 
epidemiology and other medical research, perhaps because it may be esti¬ 
mated in each of three major study designs, the prospective cohort study, 
the retrospective case-control study, and the cross-sectional survey. It has 
also been pointed out that the odds ratio forms a useful approximation to 
the relative risk in retrospective studies when the incidence of the disease in 
question is rare. The coefficients estimated in a logistic regression involving 
continuous and/or discrete independent variables can also be interpreted as 
log odds ratios. 

Other useful properties of the odds ratio include; 

(i) its value does not depend on which way round the variables are assigned 
to the rows and columns of the table; this makes it attractive when the 
causal directionality of the association is not well defined, and 

(ii) an exchange of the rows or columns of the table results simply in an 
inversion of the odds ratio; this is because the odds ratio has a similar 
functional dependence on both the row and column margins. 

The basic concepts of statistical variability in contingency table data are 
now widely appreciated by investigators in epidemiology and other branches 
of medical research. This is reflected by the almost uniform requirement of 
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biomedical journals to report statistical findings of this kind with suitable 
tests of significance, and (to a lesser extent) with confidence intervals for es¬ 
timated parameters, such as the odds ratio or relative risk. As methodologic 
support for these calculations, there now exists a considerable statistical lit¬ 
erature concerned with testing and estimation of odds ratios in a variety of 
situations, especially when an odds ratio is to be evaluated over a set of 2 X 2 
tables, with each table containing the data from within a stratum defined 
by confounding or nuisance variables. A brief review of various methods for 
confidence interval construction is given by Fleiss (1979). 

Despite this, relatively little is known about the statistical properties of 
the various point estimators of the odds ratio which have been suggested. 
This is unfortunate, because indeed it is often the point estimate itself which 
is quoted as the single statistic which best summarises the data. For example, 
one might hear that the odds ratio (or relative risk) between smokers and 
non-smokers of contracting lung cancer is about 10, or that the relative 
risk of heart disease in a particular study is 1.5 in men as compared to 
women. Such statements are often expressed in the lay press, and to some 
extent by scientists, but they are frequently made without the benefit of 
associated confidence intervals or p-values. Indeed, often great stress is made 
of numerical comparisons between the relative risks reported in different 
studies, again in purely numerical rather than statistical terms. 

For these reasons, it seems important to be able to provide an unbiased 
and precise point estimate of the odds ratio, even though such an estimate 
should be accompanied by appropriate tests and confidence intervals for 
thorough scientific reporting. It is the purpose of this paper to compare 
a number of alternative estimators, especially when the sample sizes are 
small, because it is in small samples where estimation problems are most 
acute. 


2. REVIEW OF AVAILABLE ESTIMATORS 

A simple estimator of p is given by the empirical cross-product ratio in 
the table: i.e., 

Pml = ad/(6c), (1) 

where ML denotes maximum likelihood. The corresponding estimate of the 
log odds ratio is defined by Pml = In (pml)- It should be noted that the 
maximum likelihood estimate (MLE) of p is a/n, and so by the invariance 
principle the MLE of the logit Pi/(1 - pi) is a/6. Similarly the MLE of 
P 2 /(l — P 2 ) is c/d . Hence pml is also the MLE of p; in particular, we 
will refer to this as the unconditional MLE (the UMLE), because the only 
quantities regarded as fixed are the sample sizes n 1 and n 2 . [It should be 
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noted that pml is equal to the Mantel-Haenszel estimate of the odds ratio 
for single 2x2 tables.] In contrast, the conditional MLE (to be discussed 
below), also fixes the other marginal totals, (a + c) and (6 + d). 

If b or c are zero in a particular sample, Pml is undefined. Similarly 
if any of the cell frequencies are zero, Pml is undefined. To avoid this 
difficulty, it has been suggested that one add a positive constant e to each 
cell, to create modified estimators of the form 

p € = (a + s)(d + e)/[(6 H~ s)(c + s)] (2) 

with, correspondingly, f) e = ln(/3 e ). Haldane (1956), commenting on a paper 
by Woolf (1955), advocated using e = 1/2, based on Taylor series expansion 
of In(/5) which shows that the first order bias term can be eliminated with 
this choice. It is interesting to note that this order of bias elimination can 
be achieved by using a value of e which does not depend on pi or p 2 . It has 
also been suggested that a similar modification be made to the expression 
for the standard errors of p e and /3 e , so that these too would be defined even 
if a zero cell frequency occurred (Gart and Zweifel, 1967). 

Hitchcock (1962) considered other values of e, in the context of bioassay 
analysis using a logistic model, when there are more than two dose levels 
of the experimental stimulus. In particular, she found that using e = 1/2 
sometimes created more bias in the estimate of the logistic regression coef¬ 
ficient (or equivalently in the log odds ratio) than if the correction was not 
used at all. As a compromise, she suggested using e = 1/4 when there are 
three dose levels. This suggestion was taken up by Hauck et al. (1982), who 
used this modification to define one of the estimators which they considered 
in sets of 2 x 2 tables. 

In the context of testing the significance of the association within a 2 x 2 
table, there has been considerable debate about whether it is necessary to 
condition on both margins of the table, as is done in the usual chi-square 
test. The row margins, and M 2 , are essentially fixed by virtue of the bi¬ 
nomial sampling design. The reason usually advanced for also conditioning 
on the second margin (the column totals) is that the margins do not con¬ 
tain information on the association, for example on the difference between 
the two outcome proportions pi and p 2 ; hence conditioning is a useful way 
of eliminating the effect of the nuisance parameter, which in this situation 
could be taken as the overall probability of success in the combined sam¬ 
ples. In fact it has been shown that the margins do contain information 
on the association within the table. However, the amount of information 
thus conveyed is sufficiently small that conditioning on the second margin 
is generally considered to be appropriate. This issue has been considered 
at length by Yates (1984), and by discussants of his paper. Space does not 
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permit a full discussion of all the issues here, but suffice it to say that the 
majority opinion appears to favour the conditional inference approach for 
significance testing in this context. 

Returning to estimation, the MLE of p which does condition on the 
second margin of the table is defined by setting the observed value of one 
of the cell frequencies (say the “a” cell) equal to its expectation under the 
non-central hyper geometric distribution, whose non-centrality parameter is 
P ; i-e. 

a = E{a | ni,ri 2 ,a + c,6 + d,/9) (3) 

(see Breslow and Day, 1980; equation 4.5). Computation of this CMLE will 
in general involve a polynomial of high order in p } whose numerical solution 
would require an iterative approach. We should note that the CMLE is 
undefined if a zero frequency occurs in any cell of the table. 

For completeness, we will mention some other odds ratio estimators 
which have been suggested, but which will not be considered further here, 
for a variety of reasons. The first, given in terms of /?, is due to Birch (1964), 
and is defined by 


A __ (ad - bc){n i + n 2 )(ni + ~ *) 

B nin 2 (a + c)(b + d) 

The corresponding estimator of p would be obtained simply by exponentia¬ 
tion. Birch’s estimator was not considered in this paper because it is known 
to be inconsistent. (Hauck et a/., 1982) 

Next, Tukey (cited by Anscombe, 1956) and Berkson (1953) have pro¬ 
posed estimators for a single logit, and which are “corrected” so that they 
are defined when zeros occur. It would be possible to extend their ideas to 
cover the situation of a logit difference (or equivalently a log odds ratio), 
but this was not pursued, largely on the basis of a rather poor showing in a 
comparison of the single logit estimators with Haldane’s approach, by Gart 
and Zweifel (1967). 

Finally, Schaefer (1983) has recently suggested a small sample bias cor¬ 
rection for the estimated coefficients in a multiple logistic regression analysis. 
His corrected estimates are given by 

fa = Pml + 1(X'VX)- 1 X'V {(1 - 2 iVk)x' k (X'VX) _ 1 x* } , 

where X is the (N x p ) design matrix, V = diag{7r*(l — tt*)}, {a*} de¬ 
notes a vector with elements a*, and 7r t - is the response probability for the 
ith individual, being the usual multiple logistic function of the independent 
variables. 
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For the situation with exactly one dichotomous independent regression 
variable, the problem reduces to the estimation of a single log odds ratio; it 
may be shown (Walter, 1985) that the general form of Schaefer’s correction 
produces a corrected odds ratio estimate equal to 


/? = In 




b — a c — d 
ab cd 


This estimator was also not taken further in this paper, because it is unde¬ 
fined when a zero cell occurs. 


3. EXTENSIONS TO SETS OF 2 x 2 TABLES 

Several of these estimation methods have been extended to cover the 
case of a 2 X 2 X A; contingency table. Each 2x2 table within the set of 
k such tables is defined by a level of a covariate, which would usually be a 
confounding or nuisance variable. For example, in a study of the association 
of alcohol consumption and the occurrence of a certain cancer, the k strata 
might be formed by 5-year age groups of the study subjects. The objective 
is to compute an overall odds ratio for the combined data; the within-strata 
odds ratios are often regarded as constant, although tests for heterogeneity 
are possible (Mantel et al ., 1977). A number of comparisons have been made 
of the estimators for sets of 2 x 2 contingency tables. These have mostly been 
by computer simulation, using Monte Carlo methods. Although the present 
paper is restricted to a consideration of single 2x2 tables, the findings from 
investigations of multiple tables are obviously relevant. 

Early discussion of a number of MLE’s of the odds ratio is given by Gart, 
for both the unconditional sampling space (Gart, 1962) and for the condi¬ 
tional approach (Gart, 1970). Specific numerical examples are provided, but 
there is no systematic evaluation of the estimators over a variety of tables. 

A more extensive investigation using Monte Carlo methods was carried 
out by McKinlay (1975), who had the objective of comparing matched and 
unmatched sampling strategies. The estimators considered by her were: 
an unstratified cross-product estimate (given by applying equation 1 to the 
marginal 2x2 table obtained by collapsing across the k strata), Woolf’s esti¬ 
mate modified with 1/2 added to both the stratum-specific estimates and to 
their standard errors, the Mantel-Haenszel estimate (1959), Birch’s estimate, 
and an estimate based on the pair-matching of the data. It was found that 
the pair-matched estimate was one of the least precise, and demonstrated 
considerable bias. Woolf’s estimator was precise, but showed high bias, es¬ 
pecially when there were a large number of strata so that the individual 2x2 
tables were sparsely populated. 
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A second set of simulation results was described by McKinlay (1978). 
Here various sets of 2 x 2 tables were defined, including some where in¬ 
teraction was present, i.e., where the odds ratio varied across strata. The 
same estimators as above were used, plus a modification of Birch’s estimate 
due to Goodman (1969), and an asymptotic conditional maximum likelihood 
estimate. In this case, however, the 1/2 correction to cell frequencies was 
not used, the stated reason being that “the overall frequencies were never 
zero”; although not explicitly stated, this implies a slightly different use of 
the Haldane correction, which would be applied only as necessary when zero 
cells occur, rather than all the time. The difference in the sampling distri¬ 
butions which is implied between correcting (i) as necessary, versus (ii) all 
the time, will tend to be less important in large samples (because of the 
relatively small effect of the e correction when large cell frequencies are in¬ 
volved), but this may not be so in small samples. This suggests that the 
properties of the two estimators, defined with these alternative corrections, 
should be compared. Incidentally, the strategy of correcting for zero cells 
only as necessary is probably used often in practice; it is likely that many 
data analyses proceed using the uncorrected estimators until the problem 
of a zero actually occurs, at which time one of the “corrected” methods is 
hurriedly brought into play! It is of course difficult to document that this 
has occurred in actual study reports, because only the final version of the 
analysis is given. 

Before examining her numerical situation results, McKinlay suggested 
that “the standard error of the [conditional] estimate of the odds ratio 
should be smaller than its unconditional counterpart”. However, her results 
showed that the CMLE did not perform well in terms of bias or precision. 
The Mantel-Haenszel estimate (which is actually the first approximation in 
the iterative computation of the CMLE) did equally well, and this estimate 
was recommended by McKinlay overall, especially with large samples and 
when there are more than two strata. The Goodman and Birch estimates 
performed variably well. Woolf’s estimate was the most precise, but was un¬ 
stable in its mean value, especially when there were many strata; McKinlay 
recommended avoiding the Woolf method in this situation. 

Breslow (1981) examined a slightly different situation, again by simula¬ 
tion. He supposed that the number of 2 X 2 tables increases as the sample size 
increases; this might occur, for instance, with pair-matched data, where the 
pairs are regarded as being uniquely matched, such as when sibship matching 
is used. The estimators used here were the Woolf, the UMLE and CMLE, 
and the Mantel-Haenszel estimate. The Woolf estimate was adapted for ze¬ 
ros in a way that was slightly different again from the Haldane and McKinlay 
methods, in that 1/2 was added only as necessary, and only to those cells 
with zero entries. It was found that this strategy reduced the bias in the 
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Woolf estimate for larger sample sizes, but aggravated its instability in small 
samples; overall Breslow did not recommend using the Woolf method. The 
best estimator, in his opinion, was the CMLE, but it was recognised that 
considerably more computation is required for this than the MH or UMLE 
methods. The MH estimate was always preferred to the UMLE; however the 
bias in the UNLE was somewhat “predictable”, and so the method might be 
judiciously avoided for situations where one is alert to the possibility of bias 
and where such bias might be expected based on the simulation results. 

Finally Hauck et al. (1982) reported simulation results for sets of 2 X 2 
tables, using the UMLE and MH estimate with e corrections of 1/4 and 1/2, 
and two new “pseudo-count” estimators. In the last two, data-dependent 
corrections are made to the cell frequencies before computing the odds ratio 
with one of the usual methods; in the context of multiple tables, these esti¬ 
mators correspond to shrinking towards tables of constant odds ratio, or to 
tables of independence. The CMLE was not included. 

Yet another strategy was adopted by Hauck et al. to deal with zero cells 
which actually occurred; tables with any zero frequency cell were dropped 
altogether from the calculations for the UMLE when a correction was used. 
(For the UMLE with no e correction, tables with zeros do not contribute to 
the overall likelihood.) The statistical properties of these corrected estima¬ 
tors were therefore calculated only over the (possibly) restricted set of tables 
where the estimator could be computed. 

A rather complicated set of recommendations emerged from Hauck’s 
study. The pseudo-count methods had small variances when the odds ra¬ 
tio was one, but elsewhere had much larger bias than the other estimators. 
Among the rest, the recommended choice of estimator depended on whether 
reducing bias or mean squared error is more important, and also on whether 
the odds ratio or its logarithm is the primary parameter to be estimated. 
The 1/2 and 1/4 corrected UMLE’s showed underestimation of both the 
odds ratio and its logarithm, especially when the association in the table 
was strong. The UMLE, modified by the addition of 1/4 to each cell was 
suggested for the estimation of the log odds ratio, partly because this esti¬ 
mator showed less bias than the 1/2 corrected version. The Mantel-Haenszel 
method modified with e = 1/4 and the UMLE method with e = 1/2 were 
proposed for the odds ratio itself; the latter showed lower variance than the 
1/4 corrected estimate, but it also had large bias when the odds ratio was 
large. 
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4. REPORTING OF ZERO CELLS IN DATA 

As noted above, a number of different strategies have been used by dif¬ 
ferent authors to deal with zero-celled tables when they arise. These may 
be summarised as follows: 


Haldane 

Add 1/2 always to cells of 2 X 2 table 

Hitchcock 

Add 1/4 always to cells of 2 X A; table for logistic 
regression, especially with k — 3. 

McKinlay 

Add 1/2 as necessary, i.e., to all cells of tables 
containing a zero frequency. 

Breslow 

Add 1/2 only to zero cells as they occur. 

Hauck 

Discard tables containing zeros from the analysis, when 
using e corrections in 2 X 2 X tables. 

Hauck’s approach is simple and pragmatic for sets of 2 X 2 tables to be 
combined, and will not lose very much of the information content in the data 


if only occasional zeros occur in a small proportion of the tables. However, 
it does not seem useful to extend this approach to situations where there is 
only one 2x2 table. For instance, suppose the following table occurred in a 
study to compare the frequency of adverse side effects in persons given one 
of two alternative vaccines: 


Side Effects 



Yes 

No 

Vaccine A 

10 

990 

Vaccine B 

0 

1000 


Simply because a zero happened to occur in one of the cells would not be 
compelling reason to discard the data altogether. Such data could reasonably 
be reported using Fisher’s exact test for the significance of the differences 
in the rate of side effects in the two groups, and exact one-sided confidence 
intervals for the odds ratio. 

The occurrence of a zero may imply a qualitative difference in the results, 
as compared to a table where the cell frequencies are all non-zero. Consider 
the alternative table: 
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Side effects 



Yes 

No 

Vaccine A 

10 

990 

Vaccine B 

1 

999 


In this second table, the existence of a patient with a side effect in group B 
shows that side effects are possible in this group, whereas the zero in the first 
table is much more problematic. This would be an important distinction in 
tables arising from diagnostic testing for the presence of biochemical abnor¬ 
malities or disease; certain laboratory tests are assumed by their nature to 
be incapable of giving a “false positive” result. For instance, a test for the 
presence of a viral antibodies might be assumed, for pathognomonic reasons, 
to give positive result only when infection has taken place; the demonstration 
of a positive result for some other reason would be a crucial finding. 

For the first table above, one could compute and quote one of the odds 
ratio estimates which is defined even if a zero is present, for example the 
modified Woolf estimate with e = 1/2. However, it would be most unwise 
for an investigator to report this figure without also reporting the fact that 
there were no side effects at all in the group with vaccine B; one might wish 
to compute the one-sided confidence interval for the rate of side effects in 
this group specifically, using the methods described by Hanley and Lippman- 
Hand (1983). 

In the first table, the odds ratio estimate using Haldane’s 1/2 correction 
is 21.2; in the second table, using the same method, the estimate is 7.06. 
This indicates the instability of the results in such a small sample, and the 
investigator would do well to report the whole table. 

What is of concern in the present paper is not so much how a table with 
a zero might be reported, but how to incorporate the tables with zero cells 
into the entire distribution of possible outcome tables. When the probabil¬ 
ity of a zero occurring is not small, it seems inappropriate to simply discard 
the tables with zeros and work within the restricted and conditional dis¬ 
tribution of the other tables. Therefore we will consider the behaviour of a 
number of estimators over the entire set of tables, including those with zeros. 
Recognising the likely strategy of many data analysts in practice, some of 
the estimators to be considered here will respond to zeros only as they arise; 
others will be computed in the same way for all tables, whether or not they 
contain zeros. 

The issue of how to deal with zero cells has also been discussed in the 
context of constructing smoothed test statistics of hypotheses in log-linear 
models of larger contingency tables. The strategy of adding a positive con- 
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stant only to zero counts was proposed by Grizzle et al . (1969), but Cox 
(1970) and Goodman (1970) have suggested always adding such a constant. 
In a Monte Carlo simulation, Bhapkar and Hosmane (1984) recommended 
the latter in the construction of an adjusted Wald statistic with optimal 
properties. Koopman (1984) has also recently proposed a 1/2 correction to 
deal with zeros in the estimation of the relative risk (pi/p 2 )- Finally, Jewell 
(1984) has compared the small sample properties of the MLE, jackknife and 
other estimators of the odds ratio from matched data. 


5. ESTIMATORS TO BE EVALUATED 

With the rationale above in mind, the following estimators were chosen 
for comparison: 

(i) pi, Woolf’s estimator, defined as in equation (2) with e = 1/2, fol¬ 
lowing Haldane’s modification of the uncorrected UMLE; 

(ii) p 2 i a variant of pi, with the 1/2 correction added only when zeros 
occur in the table; specifically 

A _ f pi if abed = 0 

\ Pml otherwise. 

(iii) P 3 , Hitchcock’s estimator, defined as in equation (2), with e = 1/4; 

(iv) P 4 , a modified Hitchcock estimator with a correction of 1/4 used 
only when zero cells occur, and defined similarly to P 2 ; 

(v) p 5 , the CMLE defined in equation (3). As noted earlier, this es¬ 
timator is undefined if a zero cell occurs. Therefore the following ad hoc 
modifications will be adopted, which is defined in all tables. First, if two 
zeros occur in the same column (a + c = 0 or b + d = 0 ) we define ps = 1 . 0 . 
Second, if a = 0 or d = 0 only (i.e., there is a single zero cell in the main 
diagonal), we define p 5 as the solution to 

a + 0.5 = E(a | ni, ri 2 , a + c, 6 + d,p). (4) 

Finally, if b = 0 or c = 0, we define ps as the solution to 

a - 0.5 = E(a | ni,n 2 ,o + c, 6 + d,p). (5) 

Lastly the two pseudocount estimators considered by Hauck et al. were 
also evaluated. These are defined below for the 2 x 2 table situation. 

(vi) The first pseudocount estimator is pe = a , ci , /( 6 , c / ), where a , ) b , 9 c / 
and d! are modified cell frequencies, defined as follows. For the each cell, we 
replace its entry (V, say) by 

[r + //4]/|i + //4], 
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where / = [ N 2 — || X || 2 ]/[|| X || 2 —N 2 / 4]. Here || X || 2 = a 2 + b 2 + c 2 + d 2 
unless a = 6 = c = d, in which case || X || 2 = N 2 / 4 and / = N 2 — || X || 2 . In 
2x2xk tables, the generalisation of this estimator corresponds to shrinking 
the odds ratio estimates towards a constant value across the sub-tables. 
It should be noted that the marginal totals are not preserved using this 
estimator. 

(vii) The second pseudo-count estimator is denoted by fa. In order to 
compute it, we first calculate the modified frequency a', defined as 

a ; = [a + / ; ni(a + c)/N 2 ] / [1 + f'/N ], 

where 

II-XTI/Wa-e)*! 

' “ 1 (W>- || X || ! )/2 if a = c. 

Then the other modified frequencies are computed as b' = ni — a', c 1 = 
a + c — a', and d! — — c ; ; thus the original margins are preserved, fa 

is then computed as a'd'/( 6 , c'). In 2 X 2 X A; tables, this estimator shrinks 
the unadjusted empirical estimate back towards the null value of one, corre¬ 
sponding to independence. 

Both pseudo-count estimators (fa and fa) are defined if there is exactly 
one zero cell or a zero diagonal. If there is a zero column, fa is defined, with 
the value 1 if ni = ri 2 . The calculation of the pseudocounts for fa with a 
zero column leaves the table unchanged. As for fa , we again assigned the 
value 1 for fa if this occurred. 


6 . EXAMPLES 

In the first example, shown in Table 1 , the cell frequencies are small, 
and we see considerable variation between the seven estimators. fa> which 
in this case amounts to the uncorrected UMLE, is the largest, and fa is the 
smallest. The difference between the 1/2 correction, the 1/4 correction, and 
no correction is quite marked. 

In the second example, shown in Table 2, the total sample size has 
increased from 10 to 100. The variation between estimates is now much 
smaller, especially between the various corrected UMLE’s (^i to fa), fa is 
again the largest and fa the smallest. 

The final example (see Table 3) is taken from a paper published in the 
New England Journal of Medicine (Whitley et al. y 1977), concerning the 
association of survival with brain biopsy status in a series of herpes simplex 
virus patients. Although the sample size is moderate in total, the presence 
of a single small cell (the “ 1 ”) leads again to some disparity in the estimates. 
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Table 1 . Example of Variation Between Estimates in a Small Sample 



# of Successes 

# of Failures 

Sample Size 

Sample 1 

3 

2 

5 

Sample 2 

1 

4 

5 

pi = 4.200; 
p 5 = 4.918; 

p 2 = 6.000; 

Pe = 1.970; 

p 3 = 4.911; 

P7 = 3.315. 

p 4 = 6.000; 

Table 2. Example 

of Variation Between Estimates in 

of Successes $ of Failures 

a Moderate Sample 

Sample Size 

Sample 1 

35 

15 

50 

Sample 2 

25 

25 

50 

Pi = 2.290; 
p 5 = 2.313; 

Pi = 2.333; 
p 6 = 1.829; 

p 3 = 2.312; 
p 7 = 2.040. 

p 4 = 2.333; 


Depending on the estimate chosen, we would conclude from a doubling to 
almost a quadrupling in the odds of death for biopsy positive patients as 
compared to biopsy negative patients. 

7. NUMERICAL EVALUATION OF ESTIMATORS 

The numerical comparison of the estimates was made over a series of 
2x2 contingency tables, where the sample sizes n x and n 2 were fixed, and 
the outcome probabilities pi and p 2 were taken as known. Sets of tables 
with a range of equal binomial sample sizes were used, with n x = n 2 = 
n = 5,10,25 and 50. A number of other enumerations were also made with 
unequal binomial sample sizes, but the results were not materially different. 


84 


S. D. WALTER 


Table 3. Mortality in Herpes Simplex Patients 
Treated with Adenine Arabinoside, by Brain Biopsy Status. 


Dead Alive Total 


Biopsy Positive 5 13 18 

Biopsy Negative 1 10 11 


p x = 2.852; p 2 = 3.846; p z = 3.429; p 4 = 3.846; 

p s = 3.692; p 6 = 2.012; p 7 = 2.668. 


Second, the outcome probabilities were defined as 0.1, 0.3, 0.5, and 0.7, 
with pi being taken as the greater, so that tables with “positive” associations 
were included. These values yield odds ratios ranging from 1 to 21, thus 
covering most practical situations. Note that the results for the log odds 
ratio in tables with higher values of pi and p 2 may be obtained by symmetry. 

For each estimator, the mean, variance, bias, mean squared error (MSE), 
and average absolute error (AAE) were computed over the set of all possible 
tables with the given margins ni and n 2 . The value for each table-specific 
estimate was weighted by the (unconditional) probability of the table’s oc¬ 
currence. As noted earlier, this evaluation wa s restricted to estimators which 
are defined for all tables, so tables with zero cells or columns were included 
and weighted appropriately. It should be emphasised that this represents a 
complete numerical evaluation of the sampling distribution of each estima¬ 
tor, rather than a simulation experiment. 

All calculations were made using a FORTRAN program, with the Mi¬ 
crosoft compiler (version 3.2), running under MS-DOS 2.10 and using double 
precision throughout. The program was executed on an IBM-XT microcom¬ 
puter, fitted with an 8087 math co-processor. 

8. RESULTS 

Table 4 shows the results for all the sample sizes and p-values considered. 
We will discuss some typical results, first for ni = n 2 = 50. Among the mod¬ 
ified MLE’s for p (pi to p 5 ), the Woolf estimator is uniformly best on bias, 
MSE and AEE; also the log equivalent, fix is the best of this group. In partic¬ 
ular, the CMLE’s appear to be inferior to the Woolf estimator everywhere, 
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although some of the differences in performance are small numerically. 

For the odds ratio itself, the pseudo-count estimators fa and p 7 are 
superior to fa for a good many problems. However, p 7 tends to have very 
high MSE on occasion, particularly when the contingency table shows a 
strong association. (See, for example, the configuration pi = 0.7, P 2 = 0.1.) 

We will now turn to the results for = n 2 = 10, considering first the 
estimation of the log odds ratio. As before, fa is almost always the least 
biased estimate; fa did perform slightly better for a few problems where p 2 
was small. Of the first 5 estimators, fa also performed well on MSE and 
AAE, although the CMLE fa did slightly better in a number of problems 
with low values of p 2 . However, fa had the lowest MSE and AAE overall in 
most problems with moderate or high p 2 . 

For the odds ratio, fa performed well overall, and either it or p 5 were 
the least biased for all problems. The Hitchcock estimators fa and fa did 
poorly, with high bias and/or MSE. 

Comparing the Woolf and CML estimators, the latter [fa] was preferred 
in all respects when p 2 was small; otherwise the results were mixed, fa 
again showed very high MSE for some problems, and occasional high bias, 
especially when the odds ratio was large. 

Some but not all of these trends were repeated in the results for the 
smallest samples considered, with n\ = n 2 = 5. Comparing the uncondi¬ 
tional and conditional MLE’s fa and fa, the Woolf estimate was generally 
better on bias, and it also performed quite well on MSE and AAE, although 
there was no clear “winner” in this respect. 

Indeed for some problems, the ranking of several of the estimators on the 
basis of the MSE criterion was the reverse of the ranking for bias. Consider, 
for instance, the results when pi = 0.5 and p 2 = 0.3, for the estimators fa, 
fa, fa and fa. On the criterion of least bias, the preference would be for 
estimator 7, followed by 1, 5 and 6. However, on the basis of MSE, the rank 
order would be 6, 5, 1, and 7, exactly the reverse! 

As in earlier results, the Hitchcock estimators fa and fa appeared to be 
inferior in most cases to the Woolf estimators fa and fa. For the odds ratio, 
Pz and fa again performed poorly and are not recommended. 

The CMLE fa did well, with the lowest MSE everywhere; also it often 
had the lowest AAE, with fa sometimes being better. The UMLE fa was 
less biased than fa for a number of problems with p 2 = 0.1. [This was the 
reverse of the finding in larger samples, where the UML approach was better 
with larger values of p 2 .] 



Table 4. Results of Complete Enumeration of Alternative Odds Ratio Estimators 

in a Series of Fourfold Tables 
(Next 12 Pages) 
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Total Sample Size = 100, rii — n 2 = 50 
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Total Sample Size = 100, m = n 2 = 50 
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Total Sample Size = 100, r*i = n 2 = 50 
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Total Sample Size = 50, m = n 2 = 25 
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Total Sample Size = 50, ni = n 2 
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Total Sample Size = 20, «x = n 2 
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Total Sample Size = 20, n A = n 2 = 10 


94 


S. D. WALTER 










Total Sample Size = 20, n i = n 2 
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Total Sample Size = 10, n* = n 2 
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9. CONCLUSIONS 

A complicated pattern of results has emerged, and there does not ap¬ 
pear to be a simple answer to the question of which estimator to use in all 
problems. The optimal choice will depend on whether the odds ratio or its 
logarithm is of greater interest, and whether the estimator is to be judged on 
bias, precision, or mean squared error. The choice will also depend on the 
sample sizes involved, and the underlying binomial outcome probabilities; 
the latter suggests that prior estimates of p\ and P 2 would be very useful in 
making the choice, if such estimates are available. 

Some broad recommendations may be made, although it is clear that 
some of them cannot be made with great conviction, as the difference in 
performance between the various estimators is sometimes small, and their 
relative ranking may change rapidly as the problem specifications are altered. 

1. Contrary to some suggestions in the literature, the conditional ML 
appoach does not consistently outperform the unconditional ML method. 
In fact the CMLE ^ was usually inferior to the Woolf modification of the 
UMLE, i.e., f3\. The main exception to this was for small n (say n < 10) 
with small p 2 , where the CMLE had lower MSE and AAE. This intriguing 
finding is at variance with Hauck’s (1984) recent findings in simulations over 
sets of 2 x 2 x A; tables, with k = 5 or 10. 

2. Among the set of modified UMLE’s for /3[j3 1 to £ 4 ] it appeared that 
/?i was best. Thus 1/2 should be added to the four contingency cells in all 
cases, and this will give a better estimator than with the strategy of adding 
1/2 only as necessary when zeros arise, or of adding 1/4 in either way. 

3 . For the estimators of the odds ratio itself, the results are less clear 
cut. However, the UMLE was much better than the CMLE p$ for large 
n, whereas the CMLE tended to be better (especially with respect to MSE 
and AAE) for smaller samples. Given that the CMLE requires considerably 
more computation time, it cannot be recommended for routine use. How¬ 
ever in some problems, the gain in precision from using the CMLE can be 
substantial, in comparison to the UMLE. The difficulty here is in obtaining 
prior estimates of pi and p 2 so that those particular problems, where a gain 
in precision might be expected, may be accurately identified. 

4 . fa and p 6 appear to be promising estimators worthy of further con¬ 
sideration, frequently having the lowest MSE and AAE values; /3$ was also 
competitive with respect to bias in a number of problems. However, we 
should bear in mind here Hauck’s finding that both pseudo-count estimators 
generally had large bias when applied to sets of 2 X 2 tables. 

5. Several estimators may reasonably be ruled out of consideration. 
The Hitchcock estimators p$ and p 4 are nearly always inferior to the Woolf 
estimator pi, which outperforms p 2 for odds ratio estimation, as well as for 
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log odds ratio estimation, fa should be ruled out because of its very unstable 
MSE, especially in small samples, and fa also performed indifferently. 

In interpreting the numerical results, we should recall that the influence 
of tables with zero cells becomes greater as their probability of occurrence 
increases, i.e., in small samples. It is quite possible that the performance 
of some of the estimators has been substantially affected by the method 
selected to deal with zero cells, for example the solution of a substitute 
equation (equations (4) or (5)) for the CMLE, or the assignment of the value 
1 for fa and fa when a zero column occurred. Although these adaptations 
are ad hoc and lack any theoretical foundation, it is evident that some kind 
of adaptation is necessary for the estimator to be defined in all problems 
which might arise. Further research is needed on strategies for dealing with 
zero cells, and this may clarify the optimal choice of estimation technique, 
especially in very small samples where prior estimates of the underlying 
binomial probabilities are not available. 
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DEVELOPMENT OF FORMULAS FOR THE BIAS AND 
MEAN SQUARE ERROR OF THE LOGIT ESTIMATOR 

ABSTRACT 

Formulas for calculating the bias and mean square error of the logit es¬ 
timator are developed. These formulations involve asymptotic series which 
should provide adequate approximations provided the parameters n and P of 
the binomial model are such that nP is not small. Although the asymptotic 
expansions are algebraically complicated, they have been mathematically 
expressed in such a way that general coefficients can be isolated and calcu¬ 
lated. 


1 . INTRODUCTION 

The logit A is defined as the logarithm of the ratio of the probability of 
the occurrence of an event to the probability of its nonoccurrence. That is 
A = ln[P/(l — P )] where X has a binomial distribution with parameters n 
and P. An estimator of A is A = In [X/(n — Jf)]. There is a positive prob¬ 
ability that n — X can be zero and, thus, the moments of the distribution 
of A are infinite. Provided 0 < P < 1 and the sample size is large enough, 
the observed frequency will be positive and the logit estimate can be calcu¬ 
lated. When the observed frequency is zero, suggested remedies may involve 
techniques anywhere from smoothing the observed frequencies to the more 
general practice of adjustment by adding a simple constant, commonly |, to 
each observed frequency. Gart and Zweifel (1967) have examined the small 


1 Bureau of Communicable Disease Epidemiology, Laboratory Centre for Dis¬ 
ease Control, Tunney’s Pasture, Ottawa, Ontario K1A 0L2 

2 Department of Epidemiology and Biostatistics, The University of Western On¬ 
tario, London, Ontario N6A 5C1 

103 

I. B. MacNeill and G. J. Umphrey (eds.), Biostatistics, 103-124. 

© 1987 by D. Reidel Publishing Company. 



104 


G. A. WELLS AND A. DONNER 


sample bias of several of these logit estimators. In particular, the estimator 


Xha = In 


x+x 

n~X+ i 


is an unbiased estimator of A to order n” 1 (that is, except for terms of 
0(n~ 2 )). Haldane (1955) was the first to suggest this estimator. Comment¬ 
ing on Woolf’s estimator (1955) of log odds, he used logarithmic expansions 
of the bias to arrive at this estimator (Anscombe (1956) independently sug¬ 
gested the same estimator in his examination of the parameters of the logistic 
function using the Berkson’s minimum logit x 2 method). Unpublished re¬ 
sults of Goodman, given in a paper by Gart et al. (1985), uses this approach. 
As noted by Naylor (1967), Haldane’s paper is “hard to follow and may be 
incomplete”. A more systematic and complete development of this approach 
is required. 

For a binomial distribution with parameters n and P, methods and for¬ 
mulas for calculating the bias and mean square error of the logit estimator 
are developed. These formulations involve asymptotic series in powers of 
n -1 which for moderately large n and for values of P not too extreme, pro¬ 
vide adequate approximations to these quantities. Although the asymptotic 
expansions are algebraically complicated, they have been mathematically 
expressed in such a way that general coefficients can be isolated and cal¬ 
culated. These coefficients can then be implemented for any choice of the 
parameter n and P. Computer programs are available to calculate these 
various coefficients up to and including terms of any desired order of n~ 1 . 


2 . METHOD 

Consider a discrete random variable X that has a binomial distribution 
with parameters n and P. Let Q = 1 — P. The logit is A = ln[P/(l — P)] 
and an estimator of it is given by 



X + t \ 
n-X + t) 5 


where t is some constant to be established. 

Formulations of the bias and mean square error of the logit estimator are 
developed using an approach that is a generalization of the one considered by 
Haldane (1955), involving the logarithmic expansion of the logit estimator. 
Two results that are required in this development are given in the following 
two lemmas. The first result is a closed form expression for the central 
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moments of the binomial distribution. Using the principle of mathematical 
induction, the following lemma is established. 

Lemma 1 . Consider a random variable X distributed as a binomial with 
parameters n and P. The central moments /i r == P( 7 r ), where 7 = X — nP , 
can be expressed as: 

[r/2] r—2h+l 

A*r = (~l) r £ n h £ P r - fc+ 1 - 4 g fc - 1 +< (-l) t - 1 C h>r _ 2fc+li , 

h-0 1-1 

W«] [*=*“] 

= -E n ‘W)‘ E 

fc=0 £=1 

/qt— 2h —2£-f 2 __j_ /_ll**P**—^—2£-}-2\ 


(i) 


(2) 


where the constants 

Ch.m.t = (2 h + m- 2 )C k -i, m ,t 

+ (h + m — + {h + t — 1 )Ch,m-l,i 


subject to 


r @h t m t 0 — 0 

Co t m,l — 0 

< Co f l,l = 1 


for h y m = 0 , 1 , 2 ,... 
for m, l = 0 , 1 , 2 ,... 
(except m = t= 1) 


A second fundamental result that will be required involves certain basic, 
but convenient, relationships involving the parameter P. These relationships 
are given in the following lemma. 

Lemma 2. If P and Q are such that 0 < P < 1 and Q = 1 — P, then for 
any positive integer t, 


Q i - (-1 ) K P i = (Q- P) 1 


E d ijK {-\y~\PQy-' 

3 = 1 


(3) 


where kappa is a constant that can be set to 0 or 1 and 
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Some common functions used in the various infinite series are: 

JO if * is an even integer or zero 
J1 otherwise, 

largest integer < t‘/2, 

f0 if a < 0 
( 1 if a > 0, 

11 if i ^ H 2 
\ 0 if i — i/2, 
f 1 if t = 0 
1 1 if i > 0. 

Also the combination formula (£) is defined to be zero if a is negative, b is 
negative, or b > a. 

For the estimator A (t) of the logit A, both the bias and mean square 
error of this estimator can be expressed as an asymptotic series of powers of 
n" 1 . More specifically, Corollaries 1 and 2 establish, respectively, that the 
bias can be expressed as 


Ki = 

[•72] = 

6a = 

i = 


oo tu — 1 tu 

b(t) = (Q - P) £ £ £ A w>e , g r-°(PQnnPQ)-'“ (4) 

W — 1 8 =0 g=28 —tu + l 

and the mean square error can be written as 


oo tu —1 tu 

m W = ££ £ F«>,*j' u - 9 iPQnnPQ)~ w - (5) 

tu = l s=0 g=28 —tu 

In these expressions, A W}8jg and F Wt8}9 are constants with respect to n, P 
and t. 

To derive these expressions, the logarithmic functions 

K i+ ^) ( i+ Sf) 

must be considered. For both of these functions, the function and its deriva¬ 
tives of all orders exist at (*+ 7 )/(nP) = 0 and (*- 7 )/(nQ) = 0 respectively. 
A Maclaurin series can be considered to expand each function as a power 
series in its argument. This expansion of a function represents that function 
for only those values that are in the interval of convergence of the series. For 
such functions this interval is (-1,1]. 
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The actual domain of possible values of (t + 7 )/{nP) and (t - 7 )/( n( ?)> 
under the assumption that t is positive, each has a lower limit greater than 
— 1 but the upper limits may be greater than 1. In regards to the latter, if 
0 < t < nP and 0 < t < nQ, then both upper limits approach values less 
than or equal to 1 only for large n and moderate P . Thus, except for these 
special cases, there is a failure to include the entire range of values in the 
interval of convergence. 

Still a valid asymptotic series of powers of n _1 can be derived for b(t) 
and m(t). This series may not converge, but has the asymptotic property 
that it remains bounded as w increases. As noted by Hotelling (1953), when 
confronted with a similar problem in the study of the correlation coeffi¬ 
cient, “Though presumably divergent, it will nevertheless provide through 
its first terms a good approximation to the true moment if the sample is 
large enough”. 

In Theorem 1 the logarithmic functions are expanded and asymptotic 
series of powers of n" 1 are derived for the bias and mean square error. 

Theorem 1. Let X be a random variable distributed as a binomial with 
parameters n and P. Consider the estimator 


A (t) = In 


X + t \ 
n-X + t) 9 


where t is constant, of In (P/Q). The bias and mean square error of this 
estimator can be expressed as an asymptotic series of powers of n -1 . That 
is, respectively, 


00 vi h+w j — 2/i+l ✓ , \ 

»(<) = ££ E E f;'‘)(-ir +kW+ V+'>r 1 c», ) , 

iu = lfc=0j=2k 1=1 L V j ' 

^QV) + h _|_ + ^ pj-2h+l-L-w qL-I-w^w + H-j 


2h+l,t 


( 6 ) 


and 


00 uf h-\- to [j/2] w + /i—j+» j —2h+l 


2 t ti ( v \ ( w + h ~ 



b h - t>\ 

i - i ) 


">« = E E E E E E 

tu = l h=0 j=2h i=0 v=i t= 1 

(-pw+K+J+l- l( y w + h - ( Q V + (-1)‘ +1 P U ) 

^QW + h—v ^y-i+1 pw + h-v^ pj—2h+l — L—v) 
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Proof. To derive &(£), consider the difference 


D(t) = In ^ 


X + t \ _ (P 
n-X + t) n \Q 


= ln[ nP+ l + t )-lnf nQ -'* + t 


nP ) V n Q 

= Ml + L±l )_ ln ( 1 + ^ 


Using the asymptotic series expansion 

oo i 

D-') i+1 7 

i=1 

of the logarithmic function ln(l + x), D(t) can be expressed as 

OO OO 

D{t) = + nY - ^(-l) i+1 (*W) _1 (i - 7)* 

i=l t'=l 

00 i / 

= E(- 1 )’ +1 ( , ' n '- pi ) _1 E ( 

i -1 j=o '3/ 

00 i / .\ 


*=1 

OO t 


J=0 


= EE ' (- 1 ) i+lrl ( nP< 3)" t "V (Q* + (-1) J+1 P‘) 

i=l j=0 'J' 


= E 

i =0 


£ (*) (— 1 )’ + 1 i“ 1 (- p Q)~ i (<?* + (-l) y+ 1 P < ) 




7 J ( 8 ) 


where in the last step, when the summations involving i and j are inter¬ 
changed a zero term involving i = 0, j = 0 has been added. 

Taking the expectation of D(£), using equation (1) of Lemma 1, yields 

00 00 [j/ 2 ] j — 2 A.+ 1 r ✓ 


*(«) = EEE E ())(-i) i+ ’' + H- , c w -, w , / 

,'-n A -n /—1 t \J/ 

r <+ \ (9) 


y=o »=y /i —0 t—i 
(Q* + (-1) J+1 P*) 
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Setting w — t — h, (9) can be expressed as 
oo [j/ 2 l oo j—2 A+l r , , ,\ 

‘(') = EE E E (*’ +■ h ) (-i)" + ‘ +,+< (wH 1 h )- 1 

3 =0 h—0 w—j—h, L—l ' J ' 

CfcO-lfc+M (Q”' +fc + (-1 )*+ l p"+l>) pi-2K+l-L-v> 


Ql-l-V> t W + h-j 


n 


oo w h- f-tu j— 2/i-f-l 

= EEE E 

tu=0 h=0j=z2h t=l 


W + h 




(-i) 


(w±h) 


-1 


Ch,j-2k+i,i ( Q w+h + (-l) 3 + 1 P“' + , ‘) pj-ih+l-t^ 

Ql-l-v^+h-j] 


n 


( 10 ) 


If w — 0, then h = 0 and j — 0, and in this case ((3* u+,l '+(—1) 2+1 P“'+ h ) = q 
R emoving the zero terms involving w = 0 from equation (10), transforms 
this equation into the form of equation (6). 

To derive m(t), consider D 2 (t). Using equation (8), 


{ OO r OO / . x 

£ £ ( 2 )(- 1 )’ ,+ 1 ir 1 (^)~ ii 

Jl = V1/ 

(<?** + (-l) 3l+1 P il ) 7 J1 J 

{£[£ (‘')(- 1 )* 3 + 1 i 2' 1 ( p< ?) _<2 

l A=ok=A W 

(<?’ 2 + (-1) 32+1 P <2 ) 7 32 J 

QQ f [j/ 2 ] r oo /. v 

= £{£ 2% £ {■) (-i ) ii+1 n l {PQ)- ii 

i=0 V *=0 S 1= * ' ' 

(Q il + (-l) i+1 P il )t il ~ i n- ii 

£ (,-!! l -)(- l ) <a+ 1 i*" l W)" <a 

L i 2 =j-t ' 

(Q‘ 2 + (-l) 3 -’ +1 P‘*) +‘ n _ * 2 | 7 J 
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°° U/ “ l ( OO rtt — j + 1 r / s 

= EE 2 ‘"lE E ’ (-rvrn- 

3—0 t—0 ^ u=j v—i ' ' 

{Q v + i-l) i+1 P v ) t v ~* (j~”'j(-l) u - v+1 (u-y) - 1 

(PQ )-<—*> (Q u ~ v + t *-v-j+i n -u jy 

OO oo [j/2 ]u-y+tr / \ / \ 

= EEE E J *'C)(r;)(-»-ba= 2 )- , («J)- 

j=0u=j i=0 v=i L ' ' ' 


{Q v + (-1 ) i+1 P v ) (Q u ~ v + t «-i 


n U 7 J . (11) 


The series for the mean square error is obtained by taking the expecta¬ 
tion of D 2 (t). Using equation (1) of Lemma 1: 


oo oo [j/2] u-y+1 [j/2] J-2M-1 r / \ / \ 

W = EEE E E E 2 "'C)(“:j) 

3—0 u—j i=0 v =» h—0 L— 1 *■ ' ' \*7 * / 

(_ 1)U +^-i(v u_v)-i Cfc ,_ 2/i+1 ^ 

(g* + (-i)* + 1 p w ) (g**-" + (_i)i-»+ip«-«) 


mlt) - 


pj—h+l—l— uqH— 1-f u^u —3 


-u+h 


2 fv\ fw + h- v 
3 ~ » 



oo [y/2] oo [y/2] 

= EE E E E E 

3—0 h —0 w— 3 —h i=0 v—i L— 1 

(-l)^W-l(y w + h-v) - 1 ^,, ( Q « + (-l)*+lp«) 

|gw+/i-t; _|_ -Qi-t+1 pw+/i-vj 

where w = u — h. 

When the summations involving j, h and w are interchanged, 

oo \o/ 2 \ oo oo vj h-\-w 

tl J2 becomes £ £ £ • 

j —0 h —0 w=j-h v>= 0 h= 0 j = 2 h 


Also, if w = 0 then h, j, t and v are all zero. In this case (g t ’ + (-l)’ +1 P v ) = 
0. Interchanging the summations and removing the zero terms involving 
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w = 0, transforms (12) into the form of equation (7). This completes the 
proof of the theorem. 

The infinite series expansion of the bias and mean square error of A (t) 
established in Theorem 1 can be implemented to calculate these quantities 
to various orders of n _1 . The actual computations required using (6) and (7) 
are simple but tedious. Corollaries 1 and 2 provide alternative formulations 
for the bias and mean square error. Although these forms appear more 
complex, they are more convenient to use since most terms in the expression 
are constant with respect to n, P and t which need to be computed only 
once and then used for any future calculations. 

In Corollary 1, the computing formula for the bias is expressed in the 
form given in (4). 


Corollary 1. Let X be a random variable distributed as a bionomial with 
parameters n and P. An asymptotic series in powers of n~ 1 of the bias of 
the estimator A(£) of In (P/Q) is 

oo tu — 1 w 

t(')=(«-f)EE E (is) 

vj — 1 8=0 g=28 —tu + l 


where 


min [ t, 

9 8 \ L 

A...,, = E E E 

fc=0*=,_[£=£l u=a-[* 


+ h-l + K. g _. h 


]) 


SgSt8 u 


w + h 
9 + h 


(—l) W + h + 8+1 (w + h ) 1 Ch it g-h'+l,8-L+ldw + h t u+l t K g - 

dg — h—28 + 2t,t— u-f 1,1 — h, * 


g-h 


(14) 


Proof. The bias series is obtained by taking the expectation of D(t) in 
equation (8). This expectation is taken using equation (2) of Lemma 1. 
That is, 


oo oo[y/2][^±ll 


oo oo [3/*1 l 2 J r / .\ 

ho = EEE E (')(-i) i+< i-'c k ,,- !k+ i,,i i ->' 

j=oi~jh=o 1=1 L \J/ 

(Q* _J_ —2^+2 _|_ — 2h— 2£+2^ 

(PQ) h+t - i ~ 


—t+h 


oo [i/2] OO [ * 


OO OO l 2 j r / , V 

=EE E E { \ Vir^w+y-' 

j= 0 h= 0 w=j-h 1=1 1 V J ' 
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C hJ - 2k+1 , t t vl+k - :i {Q w + h + (-1)J+ 1 P“'+ A ) 

(gi-2^-2£+2 _|_ — 2£-f2^ ^ pQ^l—l — w 


n 


, , \izlh±l 

OO W h+w [ 2 jr / , \ 

EEE E ("+ k )(-ir +k+ ‘(»+*)- 1 

w = lh=0 j=2h 1=1 X J ' 

C KiJ - 2h+lil t w+h - 3 ' {Q w+h + (-l)' + 1 P" +fc ) 

(QJ —2^+2 —2/i—2£-j-2^ 1 — 


(15) 


Lemma 2 is now applied to 

Q w + h + (~iy + 1 p w + h and Q 3 ~^- 2 i +2 pj- 2 h- 2 l +2 

That is, 

qw+H _|_ {_iy'+lpw+h 

|* t0 + fc+l + lty j 

= (q- py~ Ki E ^+w,(-i) u - 1 (pg) u - 1 (i6) 


U = 1 


and 


Q3~ 2/i—2£+2 _|_ ^_-j^jp.7 — 2/i—2£+2 

= (Q-P)^- E ^•-2 fc -2e + 2,v,i-« y (-ir- 1 (^Q) , ' _1 - (17) 


V = 1 


The upper limit of the summation in expresssion (17) is simply 

\j — 2A — 2£ -f 4] 


since 


j — 2/i — 2£ + 4 — /cy 


J-*/ 

2 


2 


+ {-h-l+ 2) 


and if j is an even integer or zero then k 3 = 0 and [(y - k 3 )/ 2] = [j/ 2]. 
On the other hand, if j is an odd integer then k 3 = 1 and [(j — k 3 )/2] = 
[(j - l)/2] = [i/2]. 
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Substituting (16) and (17) into expression (15) yields 

[t0 + k+l + IC -1 

-»-“-J 

6(0 = (<? --p) £ EE E E E 


to = l h =0 j= 2 h 1=1 u=l 


w + h 


\to-f h+L+u+v 


( W+h ) 1 Ch,j-‘ 2 h+l,tt 


w + h—j 


d tt , + / l ,u, ( c y dy- 2 fc - 2 / + 2 , w ,i- ( c,}(P<?) i+u+, '- 3 (nPQ)- ,u . (18) 

Adjusting the order of the summations, equation (18) is transformed to 


oo to —1 / to tu-\-h a 3 a 4 / , v 

= w-p)EE{EEEE (”'< )(-* 

to = l 8=0 ^ h=Q j=ai t=l u=a% \ J / 

(w + h) 1 Chj-2h+i t it v) ' ¥h *d w +h tUtKj . 
dj-2h-2l+2,e-l-u+3,l-Kj j CPQ) a (nPQ)~ W , 

ai = 2h + max(0,2s - w - h + 1), 

a 2 = min(s, [j/2] - h) + 1, 

a 3 = max(0, s - [j/2] + h) + 1, 

. ( . ^ \w + h — l + /c-l\ 

a 4 = mm ( s - 1 + 1, ---- ) + 1. 


w + h 


where 


\10+/l + S+l 


Equation (19) is now transformed using two changes of variables. First, j is 
replaced by j + 2h. Then u and £ are changed to u +1 and £+1 respectively. 
These changes yield: 


oo iu — 1 ( to to — h 62 64 / , \ 

ho = («-p)EEEEEE 

w=l 8=0 K h= 0 j=bi t =0 u=t 3 XJ 7 

(w + h)~ 1 Ch,j+l,t+lt W ~ 3 ~ h d w+ h,u+l,K ) dj- 2 l,a-t-u+l,l-K j 

(PQ) a (nPQ)~ w , 


to+/i+s+l 


where 


fei = max( 0 , 2s + 1 — w — h), 

62 = min(s, [i/2]), 

63 = max( 0 , s — [j/ 2 ]), 

* . ( . \w + h- 1 + k 3 - 

64 = min is — £, ---- 
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Setting g = h+ j, and then replacing l by s - l, (20) can be expressed 
as 


OO w — l ( w g 8 C A / I \ 

tU = l 9=0 g~cI h=0 t—C 2 u — c 3 


W -f* h-\- 8 -f-1 


(tx; + h) 1 Ct l) g-h+l l 8-l+ldw+h i u+l,Kg-Hdg-h-28+2l 1 l-u+l,l-K'g-h 

t v, - g \(PQy(nPQ)- ,u , (21) 


where 


ci = max( 0 , 2 s - w + 1 ), 

\g-h 


C 2 = max (o, s - 
c 3 — max (o, s - 
C 4 — min (£ 


2 

g-h 


) 


W + h — 1 + K g-h 


Finally, using the indicator variable S a , equation ( 21 ) becomes 


min ( t, f + 1 ) 

OO tO — 1 ( to g 8 \ L J/ 

H0-w-nEE{ E E E E 

to = l «=0 K g=28-w + l h=0 u =fi-[ z ^] 

W« (^g + h) (-l)" +fc+,+1 («' + JO -1 

Ch^g—k+ 1,9—£-f 1 +K,u+ 1 ,K g -h. 

d^-A.-29+2^i-u+i,i-/c ff -fc^ w 3 1 ( PQ) 8 {nPQ ) w 

=(«-f)EE E ^/‘WW 

tU = 1 9=0 <7=29 —to-fl 


and the proof of the corollary is complete. 

Using a computer program, the constants A Wteig have been calculated 
for various values of w. The constants, or coefficients, calculated have been 
tabulated in Table 1 for orders up to w = 7. As noted by Haldane, the 
setting of the constant t to f leads to an unbiased estimator to order n -1 . 
That is, 
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Table 1. A W Sig Coefficients of the Asymptotic Series 6(f) 
m Powers of n -1 of the Bias. 


9 


w 

3 

0 12 3 4 

5 

6 7 

1 

0 

1.0000 -5.0000 



2 

0 

-.50000 1.0000 -.41667 




1 

-1.0000 .50000 



3 

0 

.33333 -1.5000 2.0000 -.75000 




1 

-.33333 3.0000 -5.0000 2.0000 




2 

1.0000 -.50000 



4 

0 

-.25000 2.0000 -5.5000 6.0000 -2.0917 




1 

.50000 -6.0000 20.000 -24.000 8.6833 




2 

2.0000 -12.500 19.000 -7.4167 




3 

-1.0000 .50000 



5 

0 

.20000 -2.5000 11.667 -25.000 24.000 

-7.9167 



1 

-.60000 10.000 -55.000 130.00 -132.00 

44.667 



2 

.20000 -7.5000 60.000 -172.50 194.00 

-68.500 



3 

-8.3333 45.000 -65.000 

25.000 



4 

1.0000 

-.50000 


6 

0 

-.16667 3.0000 -21.250 75.000 -137.00 

120.00 

-37.871 


1 

.66667 -15.000 122.50 -475.00 923.00 

-840.00 

270.23 


2 

-.50000 18.000 -192.50 880.00 -1893.5 

1830.0 

-605.82 


3 

-3.0000 70.000 -465.00 1225.0 

-1320.0 

458.17 


4 

30.000 -150.50 

211.00 

-80.417 


5 


- 1.0000 

.50000 

7 

0 

.14286 -3.5000 35.000 -183.75 541.33 

-882.00 

720.00 -219.04 


1 

-.71429 21.000 -238.00 1365.0 -4281.7 

7287.0 

-6120.0 1890.5 


2 

.85714 -35.000 490.00 -3237.5 11181. 

-20300. 

17760. -5603.8 


3 

-.14286 14.000 -308.00 2660.0 -10866. 

21938. 

-20460. 6662.0 


4 

28.000 -525.00 3133.7 

-7801.5 

8162.0 -2798.3 


5 

-100.33 

483.00 

-665.00 252.00 


6 



1.0000 -.50000 


b 



P-Q 

24 n 2 P 2 Q 2 
1 

24 [(nQ) 2 


+ 0(n" 3 ) 
1 

~ (ni 3 ) 2 . 


+ 0(n- 3 ). 


It is because of this fact that the choice for t suggested is | 


(22) 
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Due to this property of t = bias coefficients A Wf8 (|) have been 
calculated for 


b(t) = (Q-P) f; £ A vl> '{t){PQY{nPQ)-'“, (23) 

U/ = l 8=0 


where 

w 

A*,.® = S 

g=2a —tu + 1 

These coefficients for values of tu up to 7 are given in Table 2. 


Table 2. (|) Coefficients of the Asymptotic Series 

b (|) in Powers of n~ x of the Bias 


s 


w 

0 

1 2 

3 

4 

5 

1 

0 





2 

-.041667 

0 




3 

-.083333 

.20833 0 




4 

-.23229 

.96458 -.79167 

0 



5 

-.85833 

4.8979 -7.5875 

2.7083 

0 


6 

-3.9830 

28.807 -65.673 

50.573 

-8.7917 

0 

7 

-22.319 

195.24 -588.65 

715.05 

-307.85 

27.708 


The formula b(t) in (1) derived for the bias is based on asymptotic series 
to the logarithmic functions ln[l + (t+ 7 )/(nP)] and ln[l + (£ —7)/(n<3)]. As a 
result, the finite series expression b(t) does not converge; however, the series 
is asymptotic and “in such series the terms begin to decrease, and reach a 
minimum, afterwards increasing” (Bromwich, 1965). The larger the n and 
the less extreme the P considered, the more the range of possible values 
are in the interval of convergence of the series expansions of the logarithmic 
functions. Thus it is anticipated that for such values of n and P, the first few 
terms of the asymptotic series formulation b(t) will be a good approximation 
to the bias. 
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Similar to the bias, a computing formula for the mean square error given 
in equation (2) can be derived. 

Corollary 2. Let X be a random variable distributed as a binomial with 
parameters n and P. An asymptotic series in powers of n"" 1 of the mean 
square error of the estimator A (t) of In (P/Q) is 


oo to —1 w 

m W = Fw,8,gt w 3 (PQ) 8 (nPQ ) 

to = l 5=0 g=2e—w 


(24) 


where 


g [ 2 




0+* 


\-Kg-h~" 


'-« = EE EE E 

h=0 t=0 v=x r _ a _j £ -Hj_ ^ «>+h-«+i + ic 

min(£-r,[ —- ” ) 

L J min(£—r—u,l —/c t - +/c t - Kg-K} 

Y Y 6 g 6 t 6 r 8 u 8f 

u—8— [*- 3 ^] — r— l + *c,- —KiKg-h /=«—[ £j y^] —r — u 

2 *’ g + h Sv—iSw+h—v—ifi r fcl r^+fc+*e__ fc l 

[*"5”] + 2 -1 J — *v> + h,+ K g _ h ) 

4 , C) (T 1 r) 

—1 —(-!,«• ^to+h—v,u-f 


*0-fc-2(a-£),£-r-u-/+l,l“#C5-fc • 


(25) 


Proof. The series for the mean square error can be obtained by taking the 
expected value of (11). Using equation (2) of Lemma 1, 


\u+£+1 


oo oo [y/2] u-j+t [i/2] [ 3 2 £+ 2 ] n / \ / \ 

-m-eee e e e 

j=0u=j %=0 v=i h=0 1=1 L \ / / 

(<5” + (-1) <+1 F«) 

^y-2^-2^H-2 _j_ pj-2K-2L+2^ 


(pg)fc+4-u-l 
oo [y/2] OO 


, — u+h. 


U7 2 ] w+fc-y+* [ J 2 2 i+2 ] r 


OO OO 3 J / \ / , I \ 

= EE E E E E **<(; "y-r 

y=0 k=0 w=j-h i =0 «;=♦ £=1 \ / \ J / 
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(_!)«.+*+<+ifc w + h - v) - 1 c h .,_ 2 ^ 1 / r +fc -^ 

(i Q v -(- (-1)* +1 P ,; ) (gw + fc-v (_i)J-«+lpiu + /i-ti^ 

(g}-2K-2l+2 + pj-2h.-2l+2} (pg)«- 1-™ n ~«J 


(26) 


where u is replaced by tu + h in the last step. 

When the order of summation involving j , h , and it; are interchanged, 

oo [j/2] oo co tu h+w 

EE E EEE 

3=0h—0 w=j-h w=Qh=0j—2h 

and (26) can be expressed as 


oo w h+w [j/2] vj + h—j-\-i [* 


m 


2 6 h’ 


v\ fw + h — v 



J ~ i 


« = EEEE E E 

tw=0 h=0 j=2h i=0 v=i 1=1 

(_!,„+»+<+, w + h-vj-iCj, 

(g w + (- 1 )’ +1 P U ) (g»+fc-u _j_ 


(Q3-2K-2t+2 + (_ 1 ypj-2h-2t+2} (pg)i-1- 


n 


(27) 


The lower limit of summation for w can be set to 1 since for u; = 0 the 
corresponding term in (27) is zero. 

Using Lemma 2, for positive integers v and w + h — v, 

Q v + (-1 ) i+1 P v = (Q- P) 1 -* J2 ) r_1 ( W\ (28) 


r= 1 


and 


QVJ + h — V _|_ ^^ pVJ+h — V 

= (Q - P) 1 -*’-* J2 d w+h _ v>UtK ._ i (-l)«- 1 (PQ)»-\(29) 


It— 1 


The indicator functions l and can be incorporated, as multi¬ 

plicative terms, into (28) and (29) in order to extend these expressions to 
include the cases t; = 0 and it; + h — v = 0. Also, using Lemma 2.2, 

Qj-2h-2t+2 + ( —1)J pj-2h-2t+2 


f y-2fc-2H-4 j 


= (Q-P) K > £ d,. 2h . 2l+2tg>1 _ K .(-iy- 1 (PQ) 9 - 1 , (30) 

9 =1 
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where the upper limit of summation in (30) follows for reasons as discussed 
for equation (17). Substituting (28) to (30) into (33) yields 

oo w h+w [i/2] w+h-y-H [* 2 2 ^ 2 ] [ 5 L ] [ 2 

”(') = EEEE E E E E 

w = l h=0 j=2h i=0 v=i 1=1 r=1 u=l 

U-2h-2t+4 l 

E { 2 ‘° (j) ( w ) * 7 v ) (-i)«+*+^-+«+* 

^-i^+fc- u -i(Y W + h- \) ~ 1 C h . i .t k+1 .it w+k ~ i d„ i r lK1 

dtu+/t-u,u,(Cy_ i dj-2fc-24+2,j 1 l —Kj ^ 

^pQj L - w+ r+u+g-4 n -w^Q _ pj2(l-«,+K <( cy) ^ (31) 

where the power of the ( Q — P) term follows from 

1 H - + 1 ^j—% — 2 “H (/Cy /Cjj — 2(1 /c* “f - /Cj/cyj. 

Since 

(Q — p) 2 (i-*t+/c»#c/) = (1 ~ 4P(2 ) 1 ” k *+ k » #c j 

1-Ki+KiKj . 

= E (‘-»<+*<«-■)(-,)/ 4 /(«,)/ 

f=0 \ J / 

1 — Ki+KiKj 

= E (-1 ) f 4 f {PQ) f , 

f=0 

equation (31) can be expressed as 


oo tu /i-ftu [j/2] tu-f fc—i+1 [* 2 + ] 


1=1 


+ 1 + K • 


w-EEEE E E E E 

w = l h=0 j=2h i=0 v=i 

l-Ki+KiKj [ f ~- 2 - fc 2 2 ^ 4 ] 


r=1 


U=1 


E” E { 4 / 2 ^Q(- + ft p)(_ ir+ ^ +u+J+/ 

/=0 < 7=1 ^ \ J 1 / 

$v — lfiw+h—v — 1 (Y W + h-V) 1 C f fc | y_2^+ 3 ^v t r t Ki 

^ +fc -v,u,«y_ j dy-2 fc -2i + 2 lffl i-^}(^) < “ v,+r+u+s+/ “ 4 n~ ,u . (32) 
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Adjusting the order of summations, equation (32) is transformed to 

oo to —1 f to w+h [i/2] w+h—j+i d 2 d 4 d 6 d* 

”>«) = EE EEE E EE EE* 

w=l «=0 ^ h=0 j=di i =0 v=t 1=1 r=d 3 u=d& f=d 7 

S 4f( v \( w + h ~ ^ 

-#-K._ i+1 (l-(c„ +k+Kj .) \*/ \ j — t ) 

(_ 1 )^+'>+«g u _ 1 g w+h _ u _ 1 (v w + h - v) ~ 1 C h .,_ 2 h+ 1 .tt w+h - :i 

dv,r,Kidw+h—v,u,K } --idj—2h—21+2,a—l— r— u—/+4,1—«y ) 

(PQ)*(nPQ)—, (33) 

where 


di = 2h + max(0,2s - w - h), 
d 2 = min(s, [j/2] - h) + 1, 

c ?3 = max ^0, s - [j/2] + h 

» . i „ ^ f v — 1 + /C; 

c ?4 = mm ( 8 — £ + 1, 


w -\- h — V+1 + /cy — /c t * 


) + ', 


+ I? 


d 5 = max(0, s - [j/2] + h - r + Ki - K{Kj) + 1, 

W + h — V — 1 + Kj-i 


d 6 = min — £ — r + 2, 

d 7 — max(0, s - [j/2] + h- r- u + 2), 
d 8 = min(s — £ - r - u + 3,1 - /c* + Kt-iey), 


+ 1> 


and where the upper summation limit of s has been lowered to w — 1 since 
for s = w the indicator function 


S 

[i/2 ]—M- 


*+*+«.,• 1 
2 J 


*V-*+l (l“*W + /l+ICy 


- = 5-i = o. 


Equation (33) is now transformed as follows: first, j is replaced by j+2h\ 
second, r, u and £ are replaced by r + 1, u + 1 and £+1 respectively; third, 
g is set to be h + j; and fourth, £ is replaced by s - L These changes yield: 


OQ to —1 to ( g [ ] tv-g+i a c 3 e & e 7 

™<o=EE E EEEEEEE*‘" 

w = l 8=0 g=max(0,28—iv) ^ h=0 i=0 v=i l=t\ r=c 2 ti=c 4 f=t e 
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6 v -iS w ^.fi- 


-v-lS. 


J+|-5“* — a— <c»—»>«+fc+ic g _ ( 


*' (•) (" +£: ”) w+b^v)-‘c», 

^«,r + l,#Ci Ai—v,u+l,« ff -fc-» ^g — h— 2(a—£),£— r — u — /-f 1,1 — tc g -h ^ 

t w -'{PQy(nPQ )- w 9 (34) 


where 


ex = max I 0, s — 




e 2 = max 


(°, 


2 

9 ~h 


U/ + /l-t>+l + Kg-h ~ 


“O' 


• f» rv-i+Kii^ 

e 3 = mm It, --- I , 

( \9~h 

e± = max ( 0,s — 
e 5 = min - r, 

(o,s- 


- r - 1 + Ki - KiKg-h , 


e Q = max 


tv + h — v — 
g-h 


v - 1 + Kg-h-i 1 \ 

2 \)' 

-«), 


€7 = min (£ — r - u, 1 — /c* + . 

Using the indicator function, more specifically, 8 gi Si, 8 ry 8 U and 6/, 
equation (34) is transformed into equation (24) and the proof is complete. 

The constants F Wt8t9 have been calculated for various values of w using 
a computer program. The calculated constants are presented in Table 3 for 
values of w up to and including 7. 

Disregarding terms of order higher than n” 1 , the familiar expression for 
the mean square error, 

= nPQ = nP + ^Q > 

derived, for example by Woolf (1955), follows from using the coefficients in 
Table 3. 

For t = 1/2", the mean square error coefficients F 1Ui «( 1/2) have been 
calculated for 

= EE F~A*){pQr{"PQ)- w > 

tu=l s=0 


(35) 
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Table 3. F Wt0%g Coefficients of the Asymptotic Series m(t) 
in Powers of n" 1 of the Mean Square Error 

9 


w s 0 1 2 3 4 5 6 7 


1 0 1.0000 

2 0 1.0000-3.0000 1.7500 

1 -4.0000 8.0000 -5.0000 

3 0 -1.0000 5.5000 -8.8333 4.000 

1 4.0000-22.000 37.333 -17.667 

2 12.000 -24.000 13.000 

4 0 .91667-8.3333 25.917 -32.000 

1 -4.3333 42.667 -142.17 184.50 

2 2.6667-45.333 181.00-260.33 

3 -28.000 56.000 

5 0 -.83333 11.417 -58.278 136.67 

1 4.6667 -70.500 387.72 -960.00 

2 -5.3333 111.00 -716.78 1959.0 

3 -26.667 313.33 -1090.0 

4 60.000 

6 0 .76111 -14.700 111.63 -422.50 

1 -4.9667 105.90 -866.83 3470.8 

2 8.2000-220.60 2077.8 -9138.0 

3 -2.0444 116.80 -1559.3 8294.0 

4 173.33 -1746.7 

5 

7 0 -.70000 18.150-192.00 1066.6 

1 5.2333 -149.20 1698.2 -9981.7 

2 -11.267 386.33 -4976.7 31944. 

3 6.1333 -323.47 5347.6 -40171. 

4 42.933 -1519.5 16506. 

5 -933.33 

6 


12.757 

-75.778 

113.42 

-29.000 

-114.18 

52.861 



1051.10 

-393.97 



-2284.1 

887.73 



1446.7 

-603.33 



-120.00 

61.000 



829.43 

-784.75 

270.46 


-7099.6 

6911.1 

-2423.6 


19935. 

-20259. 

7289.3 


-20318. 

22204. 

-8329.3 


5641.0 

-1749.7 

2905.1 


-124.00 

248.00 

-125.00 


-3329.3 

5762.9 

-5024.2 

1649.3 

32526. 

-58135. 

51845. 

-17263. 

-110960. 

207660. 

-191180. 

64919. 

155010. 

-311720. 

301080. 

-105220, 

-78474. 

179420. 

-187820. 

68762. 

8670.7 

-26782. 

32984. 

-13181. 


252.00 

-504.00 

253.00 


where 

w 

g=2s—w 

These values are presented in Table 4 for w from 1 to 7. When w = 1 , of 
course, there is no change in m(t) since it is independent of this parameter. 

Similar to the discussion on the bias, the formula for mean square error 
given by (5) is an asymptotic series and it is anticipated that for large n and 



BIAS AND MSE OF LOGIT ESTIMATOR 


123 


Table 4. F WfS (|) Coefficients of the Asymptotic Series 
m (|) in Powers of n” 1 of the Mean Square Error 


w 

0 1 

2 

3 

4 

5 

1 

1.0000 





2 

.50000 -2.0000 





3 

.83333 -4.0000 

4.0000 




4 

2.2517 -14.007 

23.000 

-8.0000 



5 

8.3389 -64.212 

152.59 

-115.00 

16.000 


6 

39.162 -360.05 

1124.2 

-1364.1 

533.00 

-32.000 

7 

222.73 -2381.5 

9220.0 

-15585 

10884 

-2359.0 


P values in the midrange that the first few terms of the formulation will be 
a good approximation to the mean square error. 

Corollary 2 provides an asymptotic series approximation of the mean 
square error of A(t). Using this result and b(t) of Corollary 1, a similar 
asymptotic series in powers of n _1 of the variance of this esimator can be 
derived. 


3. DISCUSSION 

Asymptotic series in powers of n _1 were developed in regards to the bias 
and mean square error for logit estimation. Such series are characterized by 
terms that begin to decrease, reach a minimum, and increase thereafter. For 
large n and moderate P, the first several terms of each series will provide an 
adequate approximate to the quantity it represents. Even for small values of 
n, the performance of these series are often satisfactory by utilizing the first 
few terms. The one notable exception is the asymptotic series for the bias 
of the logit estimator at small values of P. The optimal number of terms to 
retain depends on the parameters n and P. 

The series for the bias illustrates the reasoning for the customary 1/2 
cell adjustment. However, the boundary problem of the estimator and the 
mean square error criterion raise issues in regards to the “optimality” of 
1/2. The asymptotic series b(t) and m(t) for the bias and mean square error 
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of the logit estimator, respectively, have been developed as functions of t. 
They will provide an analytical basis for a more thorough investigation of 
cell adjustment and its ramifications on the estimation of the log odds ratio 
and common odds ratio. 

The asymptotic series have been mathematically expressed in such a way 
that general coefficients can be developed. This makes the application of the 
formulas to the log odds ratio a simple step. Based on these formulations, a 
thorough investigation of the bias and mean square error of the log odds ratio 
estimator can be investigated. In particular, as noted above, an analytical 
evaluation of the effect of the cell adjustments can be examined. 
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ESTIMATION OF A COMMON ODDS RATIO 
ABSTRACT 

For comparing two proportions from stratified samples, one approach is 
via inference on the odds ratio. The various point and interval estimators of 
a common odds ratio from multiple 2x2 tables are reviewed in this paper, 
with particular emphasis on the point estimators, their asymptotic proper¬ 
ties, and what is known about finite-sample properties. Based on research 
to date, the conditional maximum likelihood and Mantel-Haenszel estima¬ 
tors are recommended as the point estimators of choice. Confidence interval 
methods have not been studied as well, but there is a method associated 
with the Mantel-Haenszel estimator that is a good choice. There is the pos¬ 
sibility of improvement in the finite-sample properties of these estimators. 
However, further work is needed before one of these modifications can be 
recommended for general use. 

1. INTRODUCTION 

One of the common problems of statistical inference is that of comparing 
two proportions adjusting for stratification. This problem arises in nonran- 
domized studies where the data may be stratified according to values of a 
confounding variable in order to eliminate the bias due to confounding (An¬ 
derson et al ., 1980, Chapter 7). In randomized studies, stratification may 
be employed to increase the precision of inferences, particularly if stratified 
randomization had been used (Green and Byar, 1978). 

One approach to the comparison of two proportions is by inference re¬ 
garding the corresponding odds ratio. In this paper, the various point and 
interval estimators of the odds ratio are reviewed. Asymptotic properties 
under both asymptotic conditions, fixed number of strata and increasing 
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number of strata, are presented as each estimator is introduced in Section 
4. Finite-sample comparisons of the estimators are presented in Section 5. 

2. NOTATION AND ASSUMPTIONS 

After stratification, the data fall into K 2x2 strata as follows for stratum 
k: 


Group 


1 2 


Success 

Outcome 

Fail 


Xih 


X 2k T k 


N lk - X lh N 2k - X 2k N k - T k 


Ni k N 2 k N k 


Each Xi k is assumed to be binomial with parameters P** and Ni k , % = 1,2 
and k = 1,..., K, and to be independent of all the other binomial variables. 
Implicit in the binomial assumption is an assumption of within-stratum ho¬ 
mogeneity. In particular, within each stratum, the same probability of suc¬ 
cess is assumed to apply to all units belonging to each group. Consequently, 
results based on this assumption will not apply in the case of stratification 
of a numerical variable such as age, though one can argue that they ought 
to be a reasonable approximation if the number of strata is large enough to 
ensure approximate homogeneity. 

The odds ratio in the fcth stratum is 

*l>k = P\ k Q2 k /{Ql k P2 k )y 

where Qi k = 1 — Pi k , * = 1,2 and k = 1,..., K , and 0 < ip k < oo is assumed 
throughout. In this review, we consider only the case where the odds ratio 
is the same in all strata: 


ip k = ip for all k = 1 ,..., K . 

The natural logarithm of the odds ratio will be denoted by 7 *. = In tp k and 
7 = In t/>, an d the total sample size will be denoted by N = ^2 k —i N k . 
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3. ASYMPTOTIC CASES 

Consideration of the asymptotic properties of estimators of tp is compli¬ 
cated by the existence of two appropriate asymptotic cases, referred to here 
as the fixed-strata and increasing-strata cases. In the fixed-strata case, the 
number of strata is fixed, and the sample sizes within each stratum become 
large. In the increasing-strata case, the converse holds; the stratum sizes are 
fixed, but the number of strata becomes large. More formally, 

Fixed-strata asymptotic conditions: 

a) K is fixed; and 

b) = NAik , where A ^ > 0 for all i and fc, 
and N —► oo. 

Increasing-strata asymptotic conditions: 

a) The Nik, * = 1,2 and k = 1,..., K are fixed; and 

b) K oo. 

For the fixed-strata case, inference is based on the binomial distributions 
introduced in Section 2. When working with the increasing-strata case, 
the principal distribution of interest will be that of Xik given T*, Nik , an d 
N 2 k- As originally noted by Fisher (1935), this distribution is the extended 
hypergeometric (Harkness, 1965): 


Pr[Xik = x\Tk,N lk ,N 2k } = 


( N ?)&r 


(3.1) 


where a k = max(0,Tjt - N 2 jt) and b k = min(Tk, Nik)- When inference is 
based on the distribution (3.1), asymptotic results for the increasing-strata 
case will depend on the marginal joint distribution of (T*, Nik, N 2 k)- Fol¬ 
lowing Breslow (1981), it is convenient to index possible values of this triplet 
(possible marginal configurations) by j = 1,..., J, where J is assumed to 
be finite, and let x y = lim*_*oo Kj/K where Kj is the number of tables with 
configuration j. 


4. THE ESTIMATORS 

We will consider five odds ratio estimators (Table 1), two estimators 
appropriate in the fixed-strata case and three that are appropriate in both 
asymptotic cases. Lastly, some modifications to the primary estimators are 



128 


WALTER W. HAUCK 


Table 1. Point Estimators of the Odds Ratio from 
Multiple 2x2 Tables 


A. Appropriate in fixed-strata case only 

Unconditional maximum likelihood Gart (1962) 

Woolf’s Woolf (1955), Gart (1966) 

B . Appropriate in both asymptotic cases 

Conditional maximum likelihood Birch (1964) 

Empirical Bayes Pierce and Sands (1975) 

Mantel-Haenszel Mantel and Haenszel (1959) 


introduced. Confidence interval methods will be introduced along with their 
corresponding point estimator. 

4.1 Maximum Likelihood (Unconditional) 

The first estimator is maximum likelihood based directly on the bino¬ 
mial distributions introduced in Section 2. The corresponding likelihood, 
dropping the combinatorial terms, is 

t=nn^“-. (i- 1 ) 

t=l k-1 


Since 

P2k = Plk/i^PQlk + Plk), k = 1 ,..., if, 

the likelihood is actually a function of only K+l parameters Pu ,..., Pi*?, tp. 
The maximum likelihood estimate of tp,tpML> is then found by maximizing 
(4.1) with respect to the K+l parameters simultaneously. This is the odds 
ratio estimate obtained by fitting the no-three-factor-interaction model in 
log-linear or logit analyses, and any of the standard methods for obtaining 
estimates for those models could be used here. This may require working 
with 7 and 

p k = \n(P lk /Q lk ), k=l,...,K 

to avoid boundary problems if an iterative maximization process is used. If 
tp (or 7 ) is the only parameter of interest, the maximization problem can be 
reduced to that for the single parameter; see Section 4.3. 





ESTIMATION OF A COMMON ODDS RATIO 


129 


In the fixed-strata asymptotic case, the usual optimality results for max¬ 
imum likelihood estimation hold, and x ]>ml has the following asymptotic 
distribution (Gart, 1962): 

N^^ml-xP) t M(0,xp*/V), (4.2) 

N—*oo 


where V = ^2k=i v k and 

vi 1 = {A^PtkQu )- 1 + {AuPuQtk )- 1 , k=l,...,K. (4.3) 

In the increasing-strata asymptotic case, however, xpML need not even 
be consistent. In the case of matched pairs for example (the extreme case 
of Nik = N 2 k = 1 for a d A;), $ml converges to t p 2 as the number of pairs 
(K) becomes large (Andersen, 1973). Pike et al (1980) demonstrated the 
considerable asymptotic bias, as K —> oo, of xp ml when the sample sizes 
per table are small and Breslow (1981) has demonstrated that the Nk must 
be large for the asymptotic bias to be small. The problem is that the usual 
maximum likelihood properties require a parameter space that remains fixed 
as N —► oo. This is violated in the increasing-strata case, since the number 
of parameters increases with K . 

To construct confidence intervals based on xpML ° r 7 ml is straightfor¬ 
ward. Intervals for 7, for example, are of the form 7 ml ± where 

z a / 2 is the upper a/2 percentage point of the standard normal distribution 

and V is an estimator of NV in (4.2). That is, we have 

0* 1 = {NvtPikQik )- 1 + (NuPnQu)- 1 , (4.4) 

where the Afc and Afc are based on the maximum likelihood estimates for 
all if + 1 parameters: 


Pik = [1 + exp (-p*:)] 1 


and 

Afc = Plk/i^MLQlk + A fc)- 

Clearly this approach to confidence intervals will be valid only in the fixed- 
strata case. 

Intervals for xp can be constructed by exponentiating the 7 intervals or 
directly using the large-sample variance of xpML> The exponentiation route 
is preferable here (and for the other interval methods of similar form) as it 
avoids the problem of negative values in an interval for xp. 
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4.2 Woolf’s Estimator 

Woolf (1955) proposed a noniterative estimator of 7 : 

K , K 

*tw = ]T) w klk / Y Wk > 

k=l ' k=l 


where 

ik = in \pikq2k/(qikP2k)} 

= In {X lk (N 2k - X ah )/[(N lk - X lfc )X 2fc ]} 


is the estimator of 7 from the Arth stratum, 


Pik = X%k/N%k> 
Qik ~ 1 “ Pik > 


and 


Wjfe 1 — IkPlk^llk) 1 + {N*kP2k<l2k) 

= l/-XiJb + 1/ (ATiifc — Xik) + 1/X2k + 1/ {N 2 k — ^ 2 jfc) • (4.5) 


Since 

N l !\lk~l) t M(0,l/v k ), 

N —► 00 

1 can be seen as an estimate of the large-sample variance, ( Nv k ) -1 , of 7 * 
and thus is based on the principle of weighting inversely proportional to 
the variance. (Note that the variance estimators (4.4) and (4.5) differ only 
on how the P,\ k are estimated.) The Woolf estimator can also be viewed 
as a special case of the minimum logit chi-square estimator, proposed for 
quantal-response bioassay by Berkson (1944). 

Modifications to Woolf’s estimator have been considered since 7 ^ cannot 
be calculated if there is a zero in any cell of any of the K tables. Modification 
to 7 *j and w k will be considered separately. 

Haldane (1955) and Anscombe (1956) proposed a modification to the 
estimator of the log odds that corresponds to 

Ik = In {(Xu + -5)(iV 2 jfc - X 2k + .5)/ [(N lk - X lh + .5)(X 2Jb + .5)]} 

as the estimator of the logarithm of the odds ratio. Haldane showed that 
the choice of 0.5 eliminates the N~ x terms in the bias and that therefore 


£(7i) = 7* + 0(JV- 2 ). 
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The modified weights, u/£, are defined analogously to the 

K)" 1 = l/{X lk + .5) + l/(N lk - X lk + .5) 

+ l/(X 2J b + .5) + 1 /{N 2k - X 2k + .5), (4.6) 

as proposed by Gart (1966). This “modified Woolf” estimator, 7 mw> is then 
calculated exactly as the original Woolf estimator after the addition of 0.5 
to every cell. This estimator is also known as the empirical logit estimator. 

Gart (1962) showed that, in the fixed-strata case, $w = exp( 7 w)> has 
the same asymptotic distribution (4.2) as 1 Pml and, in particular, that xpw 
is asymptotically efficient. Since the 0.5 corrections for the modified Woolf 
estimator are asymptotically negligible in the fixed-strata case, this asymp¬ 
totic optimality property will be shared by xpMW* 

The nature of the Woolf estimator is that it requires sufficiently large- 
sample sizes in each table in order for each 7 ^ to be at least reasonable in 
terms of bias. In particular, Breslow (1981) has shown that both Ni and 
N 2 must be large in order to reduce the asymptotic bias. Consequently, the 
Woolf estimators should not be expected to perform well in the increasing- 
strata case. 

Confidence intervals based on Woolf’s estimator are of the same form as 
for unconditional maximum likelihood. For 7 we have 7 mw ± z a / 2 W~ 1 / 2 , 
where W = w' k . W~~ x is an estimate of the large-sample variance of 7 mw* 

4.3 Conditional Maximum Likelihood 

In the increasing-strata asymptotic case, as mentioned earlier, the usual 
maximum likelihood estimator breaks down because there are K + l param¬ 
eters to be estimated and K is not fixed. The K parameters corresponding 
to the K tables, pi,.. .,pjc, are nuisance parameters in this problem in the 
sense that xp is the primary parameter of interest. An estimation approach 
that avoids the difficulty due to an increasing number of parameters is con¬ 
ditional maximum likelihood (CML). The principle of this approach is to 
determine the likelihood for the parameter(s) of interest, xp in this case, con¬ 
ditional on minimal sufficient statistics for the nuisance parameters and then 
to estimate the target parameter by maximizing this conditional likelihood. 

For the problem of interest here, Ti,...,T/f are the minimally suffi¬ 
cient statistics for Pi,.. .,Pk and the consequent conditional likelihood is 
the product of the extended hypergeometric likelihoods (3.1) from each ta¬ 
ble: 


(4.7) 
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Birch (1964) has shown that the value of fp that maximizes Z/, tpcML, is the 
solution to 

K K 

]Tx lfc = ££(X lfe |T fc ,V’)- (4.8) 

k=1 k= 1 

Miettinen (1970) gives the somewhat simpified form of (4.8) for M: 1 match¬ 
ing. Thomas (1975) gives a computer program for finding tpcML- 

Due to the relative difficulty of solving (4.8), particularly in the days 
before high-speed computers were commonly available, a number of closed- 
form approximations to t Pcml were proposed. [Birch (1964) and Goodman 
(1969) are notable examples.] These approximations share the property of 
being consistent only at tp = 1 . They are therefore of limited usefulness and 
are not considered further in this paper. 

One approximation worth mention is based on the asymptotic normal¬ 
ity of the extended hypergeometric distribution (Cornfield, 1956; Hannan 
and Harkness, 1963). Specifically, as the become large, the conditional 
distribution of X^ given Tk becomes normal with mean 


E[Xik\T h ,il>] = X k 


and variance 


v[x lk \ r fc ,t/>]=(J- + —Ls- + —L_ + --—s-) , 

T k -x k N lk -X k N 2h -T h + X k J 


where X k is the solution, such that a * < X k < b k , to 

X k (N 2k -T k + X k )/ [(T k - X k )(N lk - X k )\ = 

(Fleiss (1979) gives the closed-form solution to Xk as a function of tp.) X+ = 
Xik is then asymptotically normal with mean 

K 

E[X + | Ti,..., T k , *] = I T k , *) 

k=l 


and variance 


K 

V[X + I T u ..., T k , *} = £ V[Xik | Tk, # 

= 1 

Gart (1970) proposed replacing the right side (4.8) by its asymptotic 
normal approximation and then finding the solution. Breslow (1976) showed 
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that the resultant estimator (r }aml in Gart’s notation) is actually the un¬ 
conditional maximum likelihood estimator and, hence, that the usual and 
conditional maximum likelihood estimators of ip are asymptotically equiva¬ 
lent in the fixed-strata asymptotic case (a result that was also noted in the 
errata to Gart, 1971). A program for finding V >aml> such as that of Thomas 
(1975), is then an alternative to direct maximization of the likelihood (4.1) 
with respect to the K + 1 parameters, as long as the pk are not of interest. 

Andersen (1970) determined the properties of conditional maximum like¬ 
lihood estimators in the increasing-strata asymptotic case. Applied to esti¬ 
mation of the odds ratio, Andersen’s results yield 

K-oo ^ / j=i J 

where Varj(X) denotes the exact variance of the extended hypergeometric 
distribution for marginal configuration j, and i r 3 - is defined at the end of 
Section 3. The issue of the asymptotic efficiency of ipcML in the increasing- 
strata asymptotic case is unresolved. Andersen shows that the Cramer-Rao 
lower bound for the asymptotic variance is not attained by ipcML , so there 
may be an estimator that is consistent and asymptotically normal with a 
lower variance. Lindsay (1983) and Davis (1985) have partly resolved this. 
Lindsay’s results show, for the case Nik = N\ and N%k = N% for all k, that 
the CML estimator is asymptotically efficient if the Pk are from a distribution 
with more than (N\ + Ni)/2 points of support. In addition, Davis has shown 
that, except for matched pairs, no weighted average of the stratum-specific 
odds ratio is as efficient asymptotically as CML in increasing-strata cases. 

Once we allow that the pk are from some distribution, we could try 
empirical Bayes techniques. Though there has been considerable work on 
random effects logistic models, little of this seems applicable to the common 
odds ratio problem. (See Stiratelli et al. (1984) for references in this area 
and a model applicable if the 7 *. are also assumed random.) Ejigou and 
McHugh (1981) did develop an empirical Bayes estimator for the case of 
M : 1 matching, but then showed that it was the CML estimator. The 
approach of Pierce and Sands (1975) is applicable to the odds ratio problem 
but has not been tried (to this author’s knowledge), except for matched 
pairs (Pierce, 1976). From Lindsay’s results, we know that the empirical 
Bayes approach will not improve on CML asymptotically, but there is the 
possibility it may do so in finite samples. 

There are two approaches to confidence intervals that are associated with 
conditional maximum likelihood. The first is the so-called exact method, 
based on the hypergeometric distribution (4.7). Gart (1970) gives the exact 
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interval for tp with confidence coefficient at least 1 - a as (^i, ^ 2 ), found as 
the solution to 

K 

e npu k \T k M=«/2 

JeRi k=i 


EU P[jk | Tk, $ 2 ] = a/2, 

J GH 2 1 

where J = (ji,..., jjt). The set i?i consists of all J such that 3 k > ; 

f ?2 consists of all J such that < -X+. In words, the lower (upper) 

limit, tpi (t/^)? is the value of tp such that the probability of the observed or 
larger (smaller) value of X+ is a/2. 

Determination of the exact limits can be very time-consuming, partic¬ 
ularly if the are large. One approximation is to replace the extended 
hypergeometric distribution by its asymptotic normal approximation, as de¬ 
scribed above. This was done for a single 2 x 2 table by Cornfield (1956) 
and extended to multiple 2 x 2 tables by Gart (1970, 1971). G art’s method 

A A 

is that 1 — a confidence intervals for tp y [tpi y tp 2 ), are found as the solutions 
to 


[x+ - E(X+ I Ti,. ..,T k ,j> 2 )}/V{X + I T u . ..,T k ,j> 2 y' 2 = -z a/2 

and 

[x+ - E(X + I T l ,...,T k ,$ l )]/V(X+ I = +z a/2 . (4.9) 

Thomas’ (1975) program includes the exact and Cornfield-Gart interval 
methods. Mantel (1977) has proposed intervals of the form (4.9) except 
using the exact hypergeometric mean and variance for X+ and using a con¬ 
tinuity correction. 

A second approach based on conditional maximum likelihood would be 
to estimate the variance of Icml from the inverse of the negative second 
derivative matrix of the conditional likelihood in the sample. Such an ap¬ 
proach is directly analogous to that for unconditional maximum likelihood. 
Unlike unconditional maximum likelihood, however, most currently avail¬ 
able programs for solving for 7 cml do not involve the second derivatives of 
the likelihood and so this approach has not seen application. One computer 
program that does compute the inverse second derivatives is that of Breslow 
and Day (1980, Appendix V; also Smith et a/., 1981) though that program 
is for a more general problem. 
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4.4 Mantel-Haenszel Estimator 

Mantel and Haenszel (1959) proposed what is likely the most commonly 
used odds ratio estimator, though its properties have been determined ana¬ 
lytically only in recent years. The Mantel-Haenszel estimator is 


j\ I j\ 

4mh = - X 2k )/N k / ]T(J\r lfc - Xik)X 2k /N k 

k=l ' Jfc=l 

K , K 

— ^2 S k ip k / ^2 s k, (4.10) 

fc=l ' k=l 


where 

& = X lk {N 2k - X 2k )/ [ {N lk - X lk )X 2k } 

and 


S k = {N lk - X lk )X 2k /N k 

= (1/^lfc + “ Plk)P2k • (4.11) 

The first form of xp MH given in (4.10) is most convenient for calculation, 
while the second is most convenient for some analytic purposes. 

Hauck (1979) obtained the asymptotic distribution of xp mh for the fixed- 
strata case (though the derivation for the case of heterogeneous odds ratios 
was corrected by Guilbaud, 1983): 

i>) -1 M \0,1p 2 J2( M k/ V k) / |> ( 4 - 12 ) 

AT- °o \ fc=l ' \k=l ) / 


where 

Affc = (lMifc + 1/A 2k )~ 1 (l - Pik)P 2k , 

and v k is given by (4.3). By applying the Cauchy-Schwarz inequality, tpMH 
can be seen to be asymptotically inefficient for all values of the odds ratio 
except xp = 1 (except for a few degenerate cases; see Tarone et al. } 1983). 

The asymptotic distribution in the increasing-strata case was found by 
Breslow (1981): 


K x ^ 2 (xpMH - ^P) —► 

K-+oo 


/ A*- 


(4.13) 


where R = Xi(N 2 — X 2 )/N> S = ( N\ — Xi)X 2 /N, and Ej and Vary denote 
the mean and variance given the yth marginzil configuration. When xp = 1, 
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the asymptotic variance of ipMH is equal to that of tpcML, analogous to the 
relation to tpML in the fixed-strata asymptotic case. 

The basic form for confidence intervals based on the Mantel-Haenszel 
estimator is where V is an estimate of the variance of 7 mh- 

The presence of two different asymptotic variance formulas for the Mantel- 
Haenszel estimator complicates the problem of choosing an estimator of the 
variance. There have been a number of proposals each of which could be used 
for confidence intervals; see Table 2. If we were to set criteria that an ideal 
V would satisfy, we would ask that V be appropriate in both asymptotic 
cases and that V not be appreciably more difficult to calculate than tpMH 
itself. Of the nine variance estimators listed in Table 2 , only Vu and Vus 
proposed by Robins et al. (1986) satisfy both these criteria. Of these two 
Vus is the variance estimator of choice, since it, but not Vu, is invariant 
under interchange of rows. 

Robins et al. (1986) worked with the variance in (4.13), that is actu¬ 
ally applicable in both asymptotic cases (Breslow and Liang, 1982), and 
found estimators of the numerator and denominator that were consistent 

A 

in both asymptotic cases. The first proposal, Vc, is to evaluate the hyper¬ 
geometric means and variances in (4.13) at ip = tpMH • Vc is appropriate 
in both asymptotic cases, but is computationally burdensome. In the sec¬ 
ond estimator of Robins et al ., Vu, the numerator and denominator of Vu 
are consistent estimators, in both asymptotic cases, of the numerator and 
denominator of the variance (4.13). However, Vu is not invariant under 
interchange of rows or columns. For a symmetrized version, Vus , they con¬ 
sidered the arithmetic average of Vu with Vu recomputed after interchange 
of rows. (Breslow and Liang had similarly proposed symmetrizing by 
taking the geometric mean of Vh with Vjf recomputed after interchange of 
rows.) The resulting estimator is 


VusilMH) = 


E k F k Rk Z k (F k S k + G k R k ) Z k G k S k 
2 R% 2R+S+ 2 Si ’ 


where F k = {X lk + N 2k - X 2k )/N k , G k = (N lk - X lk + X 2k )/N k> R+ = 

1 ~2 k Rk> &Qd s+ = s k . 

Robins et al. also reported a simulation study that compared Vu, Vus > 
Vc.Vb.Vb^V^ and Vq in terms of agreement with the actual variance and 
validity of confidence intervals for situations corresponding to both asymp¬ 
totic cases. As would be expected from the above discussion, only Vu, Vus, 
and Vc consistently gave confidence interval coverage rates close to nominal. 
They recommend use of one of these three variance estimators for all types of 
data. By the criteria set earlier, Vus is the method of choice. An important 
caveat is that there is no reason to expect any of these variance estimators 
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Table 2. Estimators of the Variance of the Mantel-Haenszel Estimator 


Estimator Principal Reference Comments 

A. Valid in Fixed-Strata Case Only 

Vh Hauck (1979) An overestimator (Robins et al., 

1986); not invariant under interchange 
of rows or columns (Ury, 1982). 

Vh Breslow and Liang Modification to Vh that is less of an 

(1982) estimator (Robins et al., 1986); 

not invariant under interchange of 
rows or columns. 

Vq Gart, reported by Modification to Vh using estimates 

Ury (1982) of the consistent with the est¬ 

imated odds ratio and thus is invar¬ 
iant under interchange of rows or 
columns [analogous to the rela¬ 
tion between (4.4) and (4.5)]. 

B. Valid in Increasing-Strata Case Only 

Vb Breslow (1981) An underestimator (Robins et al., 

1986); simplified form for matched 
studies given by Connett et al. (1982) 
and Fleiss (1984). 

C. Valid in Both Asymptotic Cases 

Vbl Breslow and Liang Weighted combination of Vii and 

(1982) Vb\ not invariant under interchange 

of rows or columns. 

Vc Robins et al. (1986) Evaluate hypergeometric means and 

variances in (4.13) at tp = ^mh- 

Vu Robins et al. (1986) Based on consistent estimation of 

the numerator and denominator of 
(4.13); not invariant under inter¬ 
change of rows and columns. 

Vus Robins et al. (1986) Symmetrized form of Vu. 


138 


WALTER W. HAUCK 


to perform well in the case of small total sample size, i.e., small number of 
tables each with small sample sizes. 

In addition to the choices for V listed in Table 2, confidence intervals 
can be obtained from the test-based method proposed by Miettinen (1976). 
Applied in this context, Miettinen’s proposal is to determine a variance, Vm, 
from 

X 2 =7 2 /Vm, 

where x 2 ls a one degree-of-freedom chi-square statistic for testing the null 
hypothesis 7 = 0. The choice of x 2 is left open. As one example, Hauck 
and Wallemark (1983), in their comparative study, used the test proposed 
by Mantel and Haenszel (1959), but without the continuity correction. 

It should be noted that the test-based principle is not valid (Green¬ 
land, 1984) and so problems are to be expected. One problem discussed 
by Halperin (1977) (with rejoinder by Miettinen, 1977), is that the method 
requires that the variance of the point estimator not depend on the value 
of the parameter. For estimation of the odds ratio, this can be, at best, an 
approximation near the null value (7 = 0). We thus have a theoretical basis 
for expecting that the test-based intervals may not do well for 7 away from 
zero. Regardless of these considerations, the test-based method is mentioned 
here because it has received consideration as a confidence interval method 
for odds ratios from single 2x2 tables (Brown, 1981; Gart and Thomas, 
1982), and because it has been used for odds ratios from multiple tables 
(Centers for Disease Control, 1983, for example). 

4.5 Modifications to the Mantel-Haenszel, Maximum Likelihood, and Woolf 
Estimators 

Hauck et al . (1982) proposed modifying the Mantel-Haenszel and (un¬ 
conditional) maximum likelihood estimators in a manner analogous to Gart’s 
(1966) modification to the Woolf estimator. The modified Woolf estimator 
can be viewed as applying Woolf’s original formula to tables that had 0.5 
added to each cell. Hauck et al. ’s proposal was to apply the Mantel-Haenszel 
and maximum likelihood formulas to such modified tables, after adding some 
suitable constant to each cell. 

This proposal had two motivations. First, the modified Woolf estimator 
had been shown to have a smaller variance than the Mantel-Haenszel and 
maximum likelihood estimators (McKinlay, 1975; Hauck et al., 1980) and 
it was conjectured that at least some of this increase in precision was due 
to the larger effective sample size for the Woolf estimator (N + 2K instead 
of N). Second, the preliminary study by Hauck et al. found that Mantel- 
Haenszel and maximum likelihood tended to be overestimators (for ip > 1). 
Since the modification would tend to bring the estimates of tp closer to 1, 
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it was possible that modification might improve the bias properties of the 
estimators simultaneous with a decrease in variance. 

Hauck et al. (1980) also suggested that, in terms of bias, a 0.5 correction 
was too great for the Woolf estimator. Hitchcock (1962) had suggested 
0.25 as a correction in a related situation, so 0.25 and 0.5 corrections were 
considered for the Mantel-Haenszel and maximum likelihood estimators and 
a 0.25 correction for the Woolf estimator. Additional related modifications 
to the Woolf estimator have been considered by Wells (1982). All these 
estimators are asymptotically equivalent, in the fixed-strata case, to the 
corresponding original estimator. 

Pseudocount methods (Fienberg and Holland, 1970) would let the data 
determine the appropriate constant to add to each cell, instead of sticking 
with constants 0.5 and 0.25. The rationale is that the more the data were 
consistent with ip near 1 . 0 , the larger the modification would be, thus pulling 
the estimators in towards 1.0. Hauck et al . (1982) considered two such 
variants of the Mantel-Haenszel, unconditioned maximum likelihood, and 
Woolf estimators and found that, away from ip = 1, these pseudocount 
variants became much more biased than the other modifications. These two 
variants will not be considered further. 

An alternative modification to the Mantel-Haenszel estimator, based on 
the jackknife principle, was proposed by Breslow and Liang (1982). Letting 
Tmh denote the Mantel-Haenszel estimator of 7 after deleting the kth table, 
the jackknife estimator proposed by Breslow and Liang is 

K 

1JK — K*iMH ~ {K - 1) Ihm/ k > 

k —1 


and ipjK = exp(i JK ). 

Jewell (1984) also considered jackknifing as one of many possibilities he 
considered for reducing the bias of the Mantel-Haenszel estimator and its 
logarithm in matched samples and of the conditional maximum likelihood 
estimator for matched pairs. Jewell’s basic approach was to eliminate the 
K -1 term in the bias. For example, for matched pairs, direct adjustment 
for this term yields 

ipADJ = 2i?+/(2S+ + 1) 


and 

W = In {R+(2R + + 1)/ [S + (2S + + 1)]} 
compared to i pc ml = *Pmh = R+/S+- 
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5. FINITE-SAMPLE COMPARISONS 
5.1 Point Estimators 

In recent years, a number of simulation studies have been conducted 
for the purpose of determining and then comparing the properties (bias 
and precision) of the various odds ratio point estimators (Table 3). The 
first of these studies was done by McKinlay (1975), who considered cases 
corresponding to the fixed-strata asymptotic case. McKinlay found that 
the Woolf estimator was consistently the most precise, but that the bias 
of the Woolf estimator increased rapidly with increased number of strata 
for a fixed sample size. Of the three estimators she considered, only the 
Mantel-Haenszel estimator consistently removed the bias in all the cases she 
considered. 


Table 3* Finite-Sample Studies of the Point Estimators of tp and 7 


Reference 

Relevant 
Asymptotic Case 

Estimators Compared * 

McKinlay (1975) 

Fixed-Strata 

Birch, MH, MW 

McKinlay (1978) 

Fixed-Strata 

Birch, Goodman, MH, MW, UML 

Lubin (1981) 

Increasing-Strata 

CML, UML 

Hauck et a/. 

Fixed-Strata 

MH, MW, UML, 

(1982) 


(+4 variants of each) 

Jewell (1984) 

Matched Pairs 

CML (=MH) 



(+4 variants) 

Hauck (1984) 

Fixed-Strata 

CML, MH, UML 

Liang (1984) 

Both 

CML, MH, MW, UML 


*CML = conditional maximum likelihood; MH = Mantel-Haenszel; 
MW = modified Woolf; UML = unconditional maximum likelihood. 


In her second simulation study, McKinlay (1978) was interested primar¬ 
ily in the effect of heterogeneity of the strata-specific odds ratios (the ipk) 
on the properties of the estimators of the log odds ratio, but some results 
were for the homogeneous case of interest in this review. As in McKinlay 
(1975), she found the modified Woolf estimator to be usually most precise 
but with marked increases in bias for larger numbers of strata. Between 
the Mantel-Haenszel and maximum likelihood estimators, they were either 
similar or the Mantel-Haenszel estimator was less biased and more precise. 
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As noted by Hauck et al. (1982), when considered as a comparison of 
odds ratio estimators from stratified samples (rather than as a comparison of 
matching to stratification as was McKinlay’s primary intention), McKinlay’s 
(1975) work did not provide sufficient information to choose a stratified 
estimator—she did not consider either of the maximum likelihood estimators. 
Also, one could not determine to what extent her results were due to the 
approximation inherent in stratifying a continuous variable. In McKinlay 
(1978), the (unconditional) maximum likelihood estimator was considered, 
but the approximation inherent in stratifying a numerical variable remained. 

The studies of Hauck et al. (1980) and Hauck et al. (1982) were in¬ 
tended to correct for these problems for cases corresponding to fixed-strata 
asymptotics and, in the 1982 paper, to study the modifications introduced 
in Section 4.5. Results for the Woolf estimator with both the 0.25 and 
0.5 modifications confirmed McKinlay’s work—it was more precise than the 
corresponding modification to the Mantel-Haenszel and unconditional max¬ 
imum likelihood estimators, but more biased, sometimes considerably so. 
Comparing the Mantel-Haenszel and maximum likelihood estimators, the 
Mantel-Hasenszel estimator was the more precise (when compared to the 
maximum likelihood estimator with the same modification), as McKinlay 
(1978) had found. The choice of estimator depended on whether one wished 
to estimate xp or 7 . The original Mantel-Haenszel and maximum likelihood 
estimators overestimated xp by an amount that increased with xp (ip > 1 ). 
Use of the 0.25 correction for Mantel-Haenszel and the 0.5 correction for 
maximum likelihood considerably reduced the over estimation. For estimat¬ 
ing ip y the original Mantel-Haenszel estimator and the maximum likelihood 
estimator with the 0.25 correction had the best bias properties. 

The omission by Hauck et al. (1982) of the conditional maximum likeli¬ 
hood estimator was rectified by Hauck (1984) who compared the conditional 
maximum likelihood estimator to its two common competitors, the Mantel- 
Haenszel and unconditional maximum likelihood estimators for fixed-strata 
cases. (Similar comparisons were also included by Liang (1984), who also 
considered the modified Woolf estimator and found similar results.) Of 
these three estimators, the conditional maximum likelihood is superior in 
terms of both bias and precision. The properties of the Mantel-Haenszel 
estimator were similar; it was less biased as an estimator of ip than condi¬ 
tional maximum likelihood in 11 of the 36 cases considered by Hauck, but 
in all 11 of these cases, both estimators were essentially unbiased (absolute 
bias < 0.012). Otherwise conditional maximum likelihood was the equal 
or slightly superior for both 7 and ip. Unconditional maximum likelihood 
was uniformly inferior. However, compared to the modified versions of the 
Mantel-Haenszel and unconditional maximum likelihood estimators, the sit¬ 
uation is not so clear. For estimating ip , the 0.25-modified Mantel-Haenszel 
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and 0.5-modified unconditional maximum likelihood estimators were supe¬ 
rior in terms of both bias and precision for the cases considered (minimum of 
20 % decrease in average absolute bias and 10 % decrease in average variance, 
with the exception of one case of the 12 where the average variances were 
essentially equal). For estimating 7 , the conditional maximum likelihood 
estimator is slightly superior to the best of the other estimators (Mantel- 
Haenszel and 0.25-modified maximum likelihood) in terms of bias and com¬ 
parable to the Mantel-Haenszel estimator in precision. 

Maximum likelihood estimation in situations corresponding to the 
increasing-strata case (small numbers per table) was considered by Lubin 
(1981) and Liang (1984). Lubin compared conditional to unconditional max¬ 
imum likelihood and Liang compared those two methods and the Mantel- 
Haenszel and Woolf estimator for 1 : m matching and other cases of small 
numbers per table. Both found that unconditional maximum likelihood was 
a consistent overestimator of 7 (7 > 0), as also noted by Jewell (1984) for 
matched samples. Conditional maximum likelihood, while still a consistent 
overestimator, had a much smaller bias (estimated to be up to 1/50 that 
of the unconditional maximum likelihood estimator in one of Lubin’s cases) 
and was more precise. Liang found that the Mantel-Haenszel estimator was 
only slightly inferior to CML, as in the fixed-strata cases. 

For matched pairs, Jewell (1984) found that the jackknife was less ef¬ 
fective at bias reduction than methods that adjusted the Mantel-Haenszel 
estimator for the K term in the expansion for the bias. Jewell prefers 
tpAD J for estimating ip and different choices for estimating 7 depending on 
the underlying parameter values. 

5.2 Interval Estimators 

There has been considerably less attention given to evaluation of the 
properties of the various odds ratio interval estimators. Methods for a single 
2 x 2 table were compared by Gart and Thomas (1972, 1982), Fleiss (1979), 
and Brown (1981). Seven methods for multiple tables have been considered 
by Hauck and Wallemark (1983). However, the results on the three Mantel- 
Haenszel methods have been superseded by the work of Robins et al. (1986), 
who compared six Mantel-Haenszel variances in confidence intervals. Both 
Hauck and Wallemark, and Robins et al. considered both asymptotic cases. 

In the fixed strata cases, we can consider the six methods considered 
by Hauck and Wallemark as falling into four groups. Two methods, namely 
Gart’s asymptotic approximation to the exact and the Mantel-Haenszel with 
Hauck’s variance formula were conservative (that is, coverage probabilities 
greater than nominal) for all investigated combinations. The result for 
Hauck’s formula agrees with Robins et al .’s finding that Vk overestimates 
the variance of imh* The asymptotic exact method is more conservative in 
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multiple 2x2 tables than for a single table (Gart and Thomas, 1982). 

Next is a method whose coverage probability falls below the nominal 
level for all investigated combinations, namely the Mantel-Haenszel method 
with Breslow’s variance formula. This variance formula is intended for large 
numbers of tables and would not be expected to do well in the large sample 
cases. Again, the results here agree with those of Robins et al ., who found 
V B 1 ° be an underestimator of the variance of 7 mh • 

The third group consists of two methods that were inconsistent in the 
sense that their coverage probabilities were not consistently above or consis¬ 
tently below nominal. Coverage probabilities for the Woolf method ranged 
from the most conservative to the most liberal of all methods. Hauck et al. 
(1982) had found that the Woolf point estimator underestimates the mag¬ 
nitude of 7 for 7 /O, and that this bias become larger as 7 increases. This 
bias is reflected in the confidence intervals; where the coverage probability 
falls below nominal, xp tends to lie above the intervals. For example, for 
K = 10 , N = 500, tp = 5 . 0 , the estimated coverage probability for nominal 
95 % intervals was .897, and the estimated coverage probability of the inter¬ 
val lying below t p is .102. Gart and Thomas (1982) found a similar bias for a 
single table, though of small magnitude. It appears that all cell frequencies 
must be large for this method to be reasonable. 

The second inconsistent method is the Mantel-Haenszel estimator with 
standard error determined by Miettinen’s test-based method. For values of 
xp at or near 1, its coverage probabilities are near nominal. However as xp 
increases the estimated coverage probability eventually falls below nominal. 
For a single table, Brown (1981) and Gart and Thomas (1982) found similar 
results and, by considering values of xp larger than those considered by Hauck 
and Wallemark, obtained coverage probabilities much lower than nominal 
than those reported by Hauck and Wallemark. 

Lastly, Hauck and Wallemark reported a method that performed consis¬ 
tently well, in the large-sample cases: the Mantel-Haenszel estimator with 
Breslow and Liang’s compromise variance formula. Its estimated coverage 
probability was never far from nominal. In Robins et a/.’s study, the cov¬ 
erage probabilities using Vus were generally slightly closer to nominal than 
those using Vbl • 

The seven methods considered by Hauck and Wallemark for the matched 
sample cases fell into three groups. Three methods, the exact and two 
Mantel-Haenszel methods, with Hauck’s and BL’s compromise variance for¬ 
mulas, were consistently conservative. The compromise formula was the 
least conservative of the three, and, in general, Hauck’s formula was the 
most conservative of the Mantel-Haenszel methods. As in the large-sample 
cases, the confidence interval results agreed with Robins et al . 9 s results that 
V^j tended to overestimate the variance of 7 mh- Occasionally, though, the 
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coverage probability was slightly lower for V* H than for Vb or Vc • In contrast 
to the fixed-strata cases, Robins et al. found that the standard deviation 
of Vjj was much larger than that of Vb in matched sample cases so that 
Vh < Vb is possible. 

Three methods were inconsistent in the same sense as above. The inap¬ 
propriateness of the Woolf method for matched samples was confirmed by 
Hauck and Wallemark’s simulation; it was erratic to say the least, with cov¬ 
erage probabilities for the nominal 95% intervals ranging from 7% to 99.9%. 
The test-based and asymptotic exact methods were also inconsistent. Both 
were conservative for ip at or near one. For larger tp, the estimated coverage 
probabilities dropped well below nominal, particularly for the asymptotic 
approximation to the exact method. The test-based method behaved as 
in the fixed-strata cases in terms of coverage probabilities; it has coverage 
probabilities near nominal for near one, but is unreliable away from one. 
Unlike the large-sample cases, however, there is a caveat to its behavior near 
one. The average interval length and, particularly, the standard deviation of 
the interval length are inflated when ip = 1, relative to the other reasonable 
methods. A value of chi-square near zero must occasionally grossly inflate 
Vm and hence the interval length. 

The method that was most consistently near nominal of those consid¬ 
ered by Hauck and Wallemark was the Mantel-Haenszel estimator with Bres- 
low’s variance formula. Of the seven methods they considered, this was the 
method of choice. The only caution is that the estimated coverage proba¬ 
bilities tended to drop slightly below nominal for some of the 1:8 matching 
cases. Robins et al. also found coverage probabilities below nominal for 
tables with Ni = N 2 = 5. As for the fixed-strata cases, confidence intervals 
using the variances proposed by Robins et al. had coverage probabilities 
consistently near nominal. 


6. CONCLUSIONS 

The primary motivation for considering and reviewing the various odds 
ratio estimators has been the desire to reach a reasonable recommendation as 
to which estimator to use in the different situations that can be encountered. 
If, to begin with, we leave aside consideration of the modifications presented 
in Section 4.6 which have been proposed recently, the choice of estimator is 
reasonably clear. The Mantel-Haenszel and conditional maximum likelihood 
estimators share the property of being robust in the sense that they behave 
reasonably in both the fixed-strata and increasing-strata asymptotic cases. 
The Woolf and unconditional maximum likelihood estimators, on the other 
hand, are badly biased in the increasing-strata case. These asymptotic con- 
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siderations have been found to apply also in various finite-sample studies. It 
thus seems reasonable to restrict consideration to the Mantel-Haenszel and 
CML estimators. 

Considerations of asymptotic efficiency and minimum bias in finite- 
samples give an advantage to conditional maximum likelihood over the Man¬ 
tel-Haenszel estimator. Conversely, there are three advantages to the Man¬ 
tel-Haenszel method. First, it is considerably simpler, since it is noniterative, 
with bias and variance properties close to conditional maximum likelihood. 
Second, with the work of Robins et al. (1986), we have an associated con¬ 
fidence interval procedure that is also noniterative and applicable in both 
asymptotic cases. Third, Donald (1984) for the fixed-strata case and Liang 
(1985) for the increasing-strata case, have shown that the Mantel-Haenszel 
estimator remains consistent and asymptotically normal when the binomial 
assumption is violated and observations within a stratum are allowed to 
be dependent. In the fixed-strata case, Donald showed that unconditional 
maximum likelihood remains consistent, but Liang showed that the CML 
estimator does not remain consistent in the increasing-strata case. In sum¬ 
mary, the Mantel-Haenszel estimator is a very reasonable choice. 

Since all the estimators considered here are biased in finite samples, there 
is the opportunity to modify the estimators to simultaneously improve (or 
at least not worsen) their bias and precision properties. Modifications of 
the type proposed by Hauck et al. (1982), analogous to the modified Woolf 
estimator, appear reasonable in the case of moderate numbers in each table, 
but their motivation in the increasing-strata case is lacking. The jackknifed 
Mantel-Haenszel estimator proposed by Breslow and Liang (1982) and also 
considered by Jewell (1984) also appears reasonable, but here, since the 
estimator is defined by deletion of tables, the motivation in the case of a 
small number of tables is lacking. Breslow and Liang note, with respect 
to the jackknifed variance, further work is needed to determine the proper 
unit for deletion and the optimum weights for combining the pseudovalues. 
Other possibilities are the bias-adjusted estimators developed by Jewell. 

“Further work is needed” is actually a good closing. While the condi¬ 
tional maximum likelihood and Mantel-Haenszel are clearly the point esti¬ 
mators of choice among estimators that have been reasonably well studied to 
date, it is also clear that improvement of finite-sample properties is possible. 
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THE EFFECT OF CLUSTERING ON THE ANALYSIS 
OF SETS OF 2 x 2 CONTINGENCY TABLES 

ABSTRACT 

The usual methods of analyzing data in sets of 2 X 2 contingency tables 
are invalid when observations are correlated. This happens when, for exam¬ 
ple, families rather than individuals are assigned to treatments. This paper 
compares two models of clustering in dichotomous data: the beta-binomial 
model and a model developed by Cohen and Altham. The beta-binomial 
distribution is then used to model responses in K 2 x 2 contingency tables. 
This leads to an expression for the variance of the maximum likelihood es¬ 
timator of the common odds ratio, which is compared to the usual MLE of 
the odds ratio. The performance of the Mantel-Haenszel estimator of the 
odds ratio under cluster sampling is assessed using a simulation study. 

1. INTRODUCTION 

Two problems that often complicate the analysis of dichotomous data in 
comparative studies are the necessity for stratification and the occurrence 
of dependent responses. Each of these problems has been solved to some 
extent. The stratification of fourfold tables may be dealt with using the 
familiar Mantel-Haenszel procedures, which have been extensively studied 
in the past 25 years. The phenomenon of dependent responses—which we 
call the clustering problem—has recently begun to receive more attention. 
Cohen (1976) and Altham (1976) suggested a heuristic model for clustering 
in categorical data; Brier (1980) used the Dirichlet multinomial distribution. 
This paper explores what occurs when the two problems coincide. Detailed 
proofs, which are omitted, are given by Donald (1985) and Donald and 
Donner (1985). 

1 Department of Epidemiology and Biostatistics, The University of Western On¬ 
tario, London, Ontario N6A 5C1 
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2. REVIEW 


2.1 The Stratification Problem 

Consider a set of if 2x2 contingency tables, each consisting of 2 binomial 
distributions. The underlying probabilities of the tth table may be written 
as in Table 1 and the responses as in Table 2. We assume that the odds 
ratio, i/>, is common to all tables. That is, 

^ _ Plt%2t 
P2tqit 


Table 1. Underlying Probabilities of Success in the 
t ih of K 2x2 Contingency Tables. q jt = 1 — pj t . 

Response 


Group Positive Negative Sum 


1 

Pit 

<ht 

1 

2 

P2t 

?2t 

1 


Mantel and Haenszel (1959) suggested the following estimate of t/>: 

4>mh 




( 2 ) 


and the test ol Hq : %l> = 1 by the statistic 

Z>it - E(x u) 2 


t 


E wr(*u) 


( 3 ) 


which is approximately chi-square with one degree of freedom. Hauck (1979) 
showed that if K is fixed and the number of observations per table increases, 
then the asymptotic variance of x^mh is 

var^M*) = 

t 


( 4 ) 
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where 


w t 


P 2 tgltttt 

ni t ri2t 


and 

1 1 

vt = -+-. 

rcitPit <lit **2 tPn <l2t 

The Mantel-Haenszel procedures, because they employ closed rather than 
iterative estimators, are common in epidemiological research. 


Table 2. Data Array in the t ih of K 2x2 Contingency Tables . 

Response 


Group 

Positive 

Negative 

Sum 

i 

Xlt 

yu 

nit 

2 

X2t 

V2t 

n 2 t 

Sum 


Vt 

n t 


Maximum likelihood estimation, developed by Gart (1962), requires it¬ 
erative techniques. The variance of the maximium likelihood estimator of 

V>ML> IS 


var(^Mi) = V' 2 

t 


ttlttt2tPltgltP2tg2t 

nitPlt<llt + tt 2 tP 2 t? 2 t 


( 5 ) 


2.2 The Clustering Problem 

Correlated responses in categorical data manifest themselves in the ten¬ 
dency of clusters to respond in similar ways. In an independent binomial 
sample of 100 with an underlying probability parameter of .1, for example, 
the expected 10 positive responses will be distributed randomly throughout 
the sample. If, however, the sample consists of 20 clusters of size five with 
high correlation within clusters, it is more likely that the positive responses 
will be concentrated in one or two clusters and that the remaining clusters 
will display only negative responses. 
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Cohen (1976) and Altham (1976) suggested the following model for this 
phenomenon for multinomial distributions with r response categories in clus¬ 
ters of size m. Let p» be the probability of a randomly selected subject 
displaying response t (t = 1 ,.. .,r) and let p+j be the probability that the 
second member of a cluster is in category j> given that the first member is 
in category i. Then 


{ api + (1 - a) Pi if i = j 
(1 -a)pipj if«' ^ 3, 

where 0 < a < 1. Under this model, the density function for a cluster of size 
m may be written as: 


I ap + (1 - a)p m if x = m 

aq + (1 - a)q m if x = 0 

(1 - a){™)p x q m ~ x otherwise. 


It is straightforward to see that if a is 0, the responses are independent, 
and if a is large there is high correlation. In fact, if pi = x*/n is the usual 
estimator of p*, then var(p») = Cp t *( 1 — p f ), where C = 1 + (m — l)a. The 
correction factor C is identical to the sample size correction factor derived 
for clustered continuous data with a = p. (See Snedecor and Cochran, 1980, 
pp. 240) 

Brier (1980) used the Dirichlet multinomial distribution (Mossimann, 
1962) to model clustered categorical data. This distribution is based on the 
assumption that the probability parameters vary between clusters according 
to a multivariate beta distribution. In the case of a dichotomous response, 
the Dirichlet multinomial distribution becomes the beta-binomial distribu¬ 
tion studied by Skellam (1948), Griffiths (1973), Plackett and Paul (1978) 
and Paul and Plackett (1978). The probability of x positive responses in a 
cluster of size m is 


f(x,p,k,m) 


( m\ r(m)r(x + A;p)r(m — x + kq) 
\ x ) r(m + fc)r(A:p)r(fcg) 9 


where 0 < A; < oo. A large k indicates low clustering. The variance of the 
maximum likelihood estimator of p, p = x/m> is Cpq , where C = (m + 
A)/(l + A:). It may be shown that the intracluster correlation coefficient 
(Cochran, 1980, p. 241) is 


(6) 
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thus, C = 1 + (m — 1 )p. 

In addition to the useful and satisfying qualities of (6), the beta-binomial 
model holds two advantages over the Cohen-Altham model. First, it is 
founded on a more heuristically reasonable model: clustering in categorical 
data is, by definition, the tendency of all members of a cluster to act alike, 
and this effect is fully accounted for by the variation of probabilities between 
clusters. Second, the parameter a of the Cohen-Altham model possesses an 
intuitive appeal only when it is 0 or 1. To see this, consider Figure 1, which 
consists of graphs of the two dichotomous density functions for p = .5, 
a = p = .25 and cluster size m = 5. The dotted line, connecting density 
values for the Cohen-Altham model, is in fact a weighted combination of the 
Bernoulli distribution (shown by the high probabilities for 0 or 5 positive 
responses) and the binomial distribution (shown by the central peak). The 
beta distribution (solid line) displays a shape that is more consistent with 
intuition: one expects that clustering would force the high probabilities away 
from the central values of 2 and 3, leaving a dip, rather than a peak, in the 
centre. In the following, therefore, we use the beta-binomial model as a 
description of the clustering phenomenon. 



x 


Figure 1. Comparison of the Cohen-Altham (dotted line) and beta-binomial 
(solid line) models for clustered data in a cluster of size 5 with p = .5 and 


3. MAXIMUM LIKELIHOOD THEORY 

Consider a set of K 2 x 2 contingency tables in which N Jt clusters of 
size m are assigned to the jih row of the tth table. Using the beta-binomial 
density to describe the response, of the tth cluster in this row and 
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expression (1) to write in terms of pj t and we may write the likelihood 
function as: 


L{pU,Pl2,. --tPlkyk JZlll,.. . t X2 fcjy) 
K Njt 2 


=eee : 

t=ii=i j=i ' 


m'N G{xjti, kpjt), G(m XjtjykQjt) 
) G(m,k) ’ 


where G(n, z) is the gamma function ratio 


G(n,z) 


r (n + z) 

T(z) 

((n-l + z){n-2 + z)...(z) if n > 0 
11 if n = 0 


for integer n and real z . Maximization of log L requires iterative techniques. 
Details of the derivation and inversion of the information matrix were given 
by Donald (1984). 

The variance of the maximum likelihood estimator of the common odds 
ratio is 


Vml 


xf> 2 R 
k 2 RS-T 2 ’ 


( 7 ) 


where R , S and T are as in Table 3. 

Table 4 contains sample variances of the MLE for various sets of param¬ 
eters. The clusters are assumed to be evenly distributed among the tables. 
As with the usual (non-clustered) MLE, the variance increases as the com¬ 
mon odds ratio increases. Increasing the intracluster correlation serves to 
effectively decrease the sample size and thus inflate the variance; a similar 
effect occurs when the cluster size is increased (without increasing the total 
sample size). The limiting expression of (7) as A; —► oo (p —> 0) is identical to 
Gart’s expression (5) for the variance when there is no clustering. A similar 
situation occurs when m = 1. On the other hand as k —► 0 (p 1), each 
cluster tends to act as a single individual, which cuts the effective sample 
size by a factor of 1/m. 

These phenomena may be seen in Figure 2, which contains graphs of the 
ratio of the variance of the usual MLE (5) divided by the variance of the 
clustered MLE (7). For values of the common odds ratio close to 1, this ratio 
is approximately equal to 1/(1 + (m — 1)/?). It can be shown, using methods 
developed by Brillinger (1975), that under beta-binomial cluster sampling 
the usual (non-clustered) maximum likelihood estimator is consistent. Thus, 
the usual maximum likelihood techniques, with variances inflated by C m = 
1 + (m — 1 )p } are dependable under cluster sampling if the odds ratio is not 
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Table 3. Expressions contained in formula (7) for 
the variance of the maximum likelihood estimated of the 
common odds ratio under cluster sampling. 


R = —NH"(m, k) 


Y j(Cft + C u ) - 


{PltVltBlt + P2tg2t^2t) 2 
(Pit9tt) 2A it + (P2t?2t) 2 -^2t 


5 = E 


/_L 


T' \ (pitqftAit + (p2tg 2 t A 


ii, A * } 




(pit9itP2t92t(Pu9it^it-B 2 t - Pit<tetMtBit) 

(pit9it) 2A it + (p2t92t) 2A 2t 


_ G(a;, kp jt )G{m-x, kg }t ) 1 

it_ G(m,fc) (* + *?*) 


+ 


G(x, kqj t )G(m - s, fepyt) 
G(m, A:) 


a- 1 j 

§(M-%t) 2 


r, _ , T •ST'f'A PjtGfo, kp jt )G(m - X, kq jt ) y4 1 
Bit- N ^\ x ) G (m,k) ^{t + kpi t) 


k X —1 


g jt G(x, kq jt )G{m - a, fcp jf ) y> a 

^(* + %t) 2 


m\ Pj t G(x, kpj t )G{m - x, kg^) 1 

x) G[m,k) + kp,t) 2 

gj t G(x, kq jt )G(m - x, kp jt ) 1 

G(m,k) £^(* + %t) 2 
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Table 4. Sample variances of the maximum likelihood estimator of the 
common odds ratio under cluster sampling . Probability bands consist of the 
probability parameters associated with the first row of each table . The sample 
size in each case is 210 . 


1. Number of tables = 5, Common odds ratio = 1 
Probability band = (.05, .10, .15, .20, .25) 

CLUSTER SIZE 


p 

2 

4 

6 

8 

12 

.05 

.9556 

.8779 

.8120 

.7555 

.6634 

.20 

.8540 

.6660 

.5489 

.4684 

.3643 

.40 

.7474 

.5089 

.3914 

.3204 

.2376 

.60 

.6562 

.4029 

.2962 

.2362 

.1702 

.80 

.5745 

.3201 

.2252 

.1750 

.1222 

.95 

.5180 

.2666 

.1805 

.1368 

.0925 


2. 

P 

Number of tables = 5, Common odds ratio = 1 
Probability band = (.10, .30, .50, .70, .90) 

CLUSTER SIZE 


2 

4 

6 

8 

12 

.05 

.9539 

.8735 

.8058 

.7478 

.6539 

.20 

.8435 

.6457 

.5250 

.4434 

.3394 

.40 

.7307 

.4839 

.3657 

.2956 

.2153 

.60 

.6405 

.3836 

.2779 

.2195 

.1560 

.80 

.5650 

.3100 

.2164 

.1672 

.1159 

.95 

.5154 

.2642 

.1784 

.1350 

.0911 
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Table 4. (continued) 


3. Number of tables = 10, Common odds ratio = 5 

Probability band = (.04, .07, .10, .13, .19, .22, .25, .28, .31) 

CLUSTER SIZE 


p 

2 

3 

4 

6 

12 

.05 

.9609 

.9235 

.8890 

.8275 

.6867 

.20 

.8712 

.7739 

.6982 

.5871 

.4051 

.40 

.7693 

.6326 

.5411 

.4247 

.2668 

.60 

.6752 

.5191 

.4258 

.3177 

.1870 

.80 

.5861 

.4214 

.3318 

.2355 

.1295 

.95 

.5218 

.3550 

.2698 

.1831 

.0942 


4. 

P 

Number of tables = 
Probability band = 

10, Common odds ratio = 5 

(.05, .15, .25, .35, .45, .55, .65, .75, .85, .95) 

CLUSTER SIZE 

2 

3 

4 

6 12 

.05 

.9591 

.9174 

.8792 

.8117 .6604 

.20 

.8503 

.7386 

.6539 

.5335 .3471 

.40 

.7383 

.5887 

.4920 

.3732 .2211 

.60 

.6473 

.4843 

.3898 

.2833 .1598 

.80 

.5700 

.4031 

.3139 

.2195 .1179 

.95 

.5185 

.3511 

.2660 

.1798 .0918 



RELATIVE EFFICIENCY RELATIVE EFFICIENCY 

0*33 0-67 1-00 «= 0-33 0-67 1•00 
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INTRACLUSTER CORRELATION 
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10 
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00 


—I- |-1-1 
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INTRACLUSTER CORRELATION 


Figure 2. Graphs of the variance ratio Vq/Vml for 10 tables; probability 
band (.05, .15, .25, .35, .55, .65, .75, .85, .95); sample size 240; odds ratios 
5 (upper graph), 1 (lower graph). 
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large. For larger values of tp, the graphs tend to flatten, as can be seen for 
tp = 5 in Figure 2. 

If the cluster sizes are mixed, let a m be the proportion of clusters of size 
m. The results of Bradley and Gart (1962) on associated populations may 
then be applied to obtain 


V ML — ^ ^ 

where V m is the variance of the MLE for cluster size m. 


4. MANTEL-HAENSZEL PROCEDURES 

4.1 Variance of the Mantel-Haenszel Estimator 

The Hauck variance (4) of ipMH may be adjusted to compensate for 
cluster sampling. Let Njt m be the number of clusters of size m in row j of 
table t and a jt be the proportion of clusters in row j of table t. Let 


c jt — 


5^C7 m iVy 

m _ 

jtm 

m 


and 


Then 




Clt 


+ 


C 2 t 


a ltPlt<Ilt a 2tP2t<l2t 


var(t Pmh) = 


» (£«?)*’ 

t 


where wt is as in Hauck’s formula (4). 


4.2 Adjustment to the Mantel-Haenszel Test 

The Mantel-Haenszel test statistic (3) may be adjusted using the 
weighted adjustment factors Cj t defined above. Under beta-binomial sam¬ 
pling, the adjusted Mantel-Haenszel chi-square statistic, Xlf HC > is asymp¬ 
totically chi-square with one degree of freedom: 


v2 _ 

a mh — 



1 x lt y 2 t ~ x 2 tyit 1 

ttltClt + fl2tC2t 


2 


/£ 


nunuXtyt 
nt(n lt cu + n 2t c 2t - 


1 )' 
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5. SIMULATION STUDY 

Since the results above are valid asymptotically only, we designed a simu¬ 
lation study that, in part, examined the performance of the Mantel-Haenszel 
procedures in a sample of size 240 under clustering modelled by the beta- 
binomial distribution. Full details of this study, which also assessed maxi¬ 
mum likelihood techniques and methods of estimating intraclass correlation, 
were given by Donald (1985). 

The parameters used for the simulation were: 

Number of tables: 5 
Probability bands: 

Narrow band—(0.5, .10, .15, .20, .25) 

Wide band—(.10, .30, .50, .70, .90) 

Cluster sizes: 2, 4, 6 
Odds ratios: 1, 3, 5 

Correlation coefficients 0, .05, .10, .20, .50, .80 

For each set of parameters, 300 iterations were performed. 

Compared with the MLE, the Mantel-Haenszel estimator of the odds 
ratio performed well in terms of variance and bias for small values of the 
intracluster correlation coefficient and for small cluster size. As Table 5 
indicates, increasing the value of C m = 1 + (m — 1 )p diminishes the efficiency 
of the Mantel-Haenszel estimator. As expected, increasing the value of the 
common odds ratio inflated the variance of both the Mantel-Haenszel and 
the maximum likelihood estimators. Both estimators displayed consistently 
positive bias for all parameter combinations (Tables 6 and 7). This bias 
tended to rise for larger values of t \) and p , but not with respect to the 
cluster size. 

Table 8 contains the observed rejection rates of the Mantel-Haenszel test 
of Ho : $ = 1 at a = .05. For values of the intracluster correlation coefficient 
less than .20 and cluster size 4 or fewer, the test performed acceptably; larger 
values of p and m led to high rejection rates. 

In conclusion, the Mantel-Haenszel procedures perform well if the intra¬ 
cluster correlation coefficient is expected to be less than .1 and the clusters 
are less than 4. In other circumstances, the Mantel-Haenszel estimate of the 
common odds ratio has positive bias and a large variance and the Mantel- 
Haenszel test of significance is likely to yield a spuriously significant result. 
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Table 5. Observed variance relative efficiencies of odds ratio estimators. 
The ratio is the variance of the maximum likelihood estimator divided by the 
variance of the Mantel-Haenszel estimator. Hence, an entry greater than 1 
indicates that the Mantel-Haenszel estimator has lower variance. 

Probability band = (.05, .10, .15, .20, .25) 


INTRACLUSTER CORRELATION 


V- 

m 

.00 

.05 

.10 

.20 

.50 

.80 


2 

1.044 

1.032 

1.053 

1.029 

.985 

1.029 

1 

4 

1.039 

1.083 

1.063 

1.039 

.765 

.883 


6 

1.063 

1.089 

1.065 

1.097 

.981 

.508 


2 

.921 

.985 

1.092 

.903 

1.039 

.901 

3 

4 

1.057 

.906 

.877 

.971 

.588 

.439 


6 

1.019 

1.013 

.724 

.944 

.609 

.470 


2 

1.023 

.883 

1.021 

.790 

1.113 

.705 

5 

4 

.985 

1.183 

.909 

1.035 

.798 

.171 


6 

1.013 

1.244 

.875 

.759 

.331 

.796 


* 

m 

Probability band = (.10, .30, .50, .70, .90) 

INTRACLUSTER CORRELATION 
.00 .05 .10 .20 .50 

.80 


2 

1.034 

1.035 

1.062 

1.095 

1.074 

1.094 

1 

4 

1.044 

1.034 

1.064 

1.040 

.993 

.953 


6 

1.033 

1.076 

1.068 

1.083 

1.072 

1.016 


2 

1.036 

1.086 

1.087 

1.176 

1.099 

1.173 

3 

4 

1.065 

1.035 

1.080 

1.067 

1.191 

1.174 


6 

1.077 

1.060 

1.029 

1.206 

.554 

.152 


2 

1.014 

1.069 

1.020 

1.033 

1.086 

.900 

5 

4 

1.070 

1.029 

1.091 

1.375 

.998 

.464 


6 

1.075 

1.073 

1.154 

.902 

.257 

.197 
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Table 6. Observed biases of the Mantel-Haenszel estimator of the common 
odds ratio under cluster sampling. 


Probability band = (.05, .10, .15, .20, .25) 


INTRACLUSTER CORRELATION 



m 

.00 

.05 

.10 

.20 

.50 

.80 


2 

.070 

.075 

.028 

.044 

.033 

.092 

1 

4 

.095 

.102 

.108 

.136 

.209 

.185 


6 

.075 

.103 

.091 

.093 

.196 

.501 


2 

.536 

.780 

.516 

.553 

.600 

.498 

3 

4 

.790 

1.140 

1.055 

1.168 

2.016 

1.158 


6 

.554 

.866 

.978 

1.064 

1.237 

.911 


2 

1.339 

2.036 

1.322 

1.654 

1.209 

1.251 

5 

4 

1.804 

1.949 

2.255 

2.599 

2.773 

4.640 


6 

1.618 

1.662 

2.083 

2.086 

2.589 

.091 



Probability band 

= (-10, 

.30, .50, .70, .90) 
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m 

.00 

.05 

.10 

.20 

.50 

.80 


2 

.029 

.051 

.078 

.033 

.073 

.048 

1 

4 

.053 

.071 

.072 

.098 

.104 

.096 


6 

.020 

.017 

.037 

.081 

.051 

.182 


2 

.136 

.122 

.132 

.113 

.301 

.338 

3 

4 

.187 

.277 

.375 

.488 

.505 

.706 


6 

.162 

.188 

.271 

.477 

1.030 

1.334 


2 

.272 

.295 

.531 

.221 

.727 

.899 

5 

4 

.336 

.397 

.430 

.642 

1.208 

2.731 


6 

.275 

.535 

.595 

1.094 

3.223 

3.733 
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Table 7. Observed biases of the Gart (non-clustered unconditional maximum 
likelihood) estimator of the common odds ratio under cluster sampling. 


Probability band = (.05, .10, .15, .20, .25) 

INTRACLUSTER CORRELATION 



m 

.00 

.05 

.10 

.20 

.50 

.80 


2 

.072 

.077 

.026 

.043 

.040 

.100 

1 

4 

.097 

.107 

.112 

.143 

.226 

.206 


6 

.077 

.108 

.096 

.098 

.212 

.541 


2 

.591 

.834 

.582 

.629 

.738 

.583 

3 

4 

.888 

1.140 

1.168 

1.354 

2.228 

1.520 


6 

.626 

.989 

1.079 

1.230 

1.471 

1.304 


2 

1.474 

2.109 

1.454 

1.810 

1.614 

1.384 

5 

4 

1.804 

1.949 

2.255 

2.599 

2.773 

4.640 


6 

1.816 

1.934 

2.282 

2.416 

3.047 

.702 



Probability band = (.10, .30, .50, .70, .90) 

INTRACLUSTER CORRELATION 


4> 

m .00 .05 .10 .20 .50 

.80 



2 

.030 

.053 

.081 

.035 

.077 

.053 

1 

4 

.055 

.073 

.076 

.103 

.115 

.107 


6 

.021 

.018 

.040 

.087 

.062 

.228 


2 

.195 

.189 

.211 

.193 

.405 

.478 

3 

4 

.245 

.347 

.460 

.592 

.727 

.998 


6 

.216 

.252 

.346 

.657 

1.337 

1.790 


2 

.406 

.469 

.699 

.406 

.967 

1.242 

5 

4 

.505 

.562 

.645 

.949 

1.713 

3.286 


6 

.435 

.719 

.849 

1.466 

3.565 

4.311 
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Table 8. Observed rejection rates for the Mantel-Haenszel test of Ho : tp = 1 
under the null hypothesis . Asterisks indicate the significance of the difference 
between the observed rate and .05 (*, p < .05; **, p < .01; ***, p < .001). 


1. Probability band = (.05, .10, .15, .20, .25) 
INTRACLUSTER CORRELATION 


m 

.00 

.05 

.10 

.20 

.50 

.80 

2 

.048 

.019* 

.030 

.049 

.082** 

.124*** 

4 

.056 

.072 

.069 

.100*** 

.200*** 

.267*** 

6 

.055 

.093** 

.106*** 

.116*** 

.223*** 

.306*** 


2. 

Probability band = (.10 

, .30, .50, .70, .90) 




INTRACLUSTER CORRELATION 


m 

.00 

.05 

.10 

.20 

.50 

.80 

2 

.034 

.041 

.037 

.085** 

.090** 

.069 

4 

.050 

.057 

.096*** 

.110*** 

.140*** 

.179*** 

6 

.027 

.074 

.131*** 

.181*** 

.191*** 

.304*** 
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GOODNESS-OF-FIT ISSUES IN TOXICOLOGICAL 
EXPERIMENTS INVOLVING LITTERS 
OF VARYING SIZE 

1. INTRODUCTION 

Data on fetal deaths in litters of varying sizes were used by Haseman and 
Soares (1976) and re-used by Kupper and Haseman (1978) to exemplify the 
fitting of certain models for explaining the variation between litters in the 
occurrence of such deaths. Model fitting plays an important role, for then 
any treatment effects, litter-size effects, or other possible causes of variation 
can be studied by determining whether estimated model parameters are 
significantly affected by such causes of variation. 

In the work of Hasemen and Soares (1976), the principal models fitted, 
but found wanting, were a Poisson model and a binomial model. Clear 
improvements in fit were shown by Kupper and Haseman (1978) by either 
a beta-binomial (BB) model or a correlated-binomial (CB) model. Since 
neither of these two models nested into the other and differences in fit were 
not extreme, no formal test of which was the more appropriate model was 
feasible. 

Actually, both of these models take intra-litter correlation into account. 
By the beta-binomial model, a form of compound binomial, a positive intra¬ 
litter correlation would result because of litter members sharing a common 
binomial parameter. With the correlated binomial model, the intra-litter 
correlation could be either positive, because of the community of the pre¬ 
natal exposures of litter members, or negative, because of competition be¬ 
tween litter members. 

Paul (1985) derived a model for combining the features of both of these 

1 Department of Mathematics, Statistics and Computer Science, The American 
University, 4900 Auburn Avenue, Bethesda, Maryland 20814 

2 Department of Mathematics and Statistics, University of Windsor, Windsor, 
Ontario N9B 3P4 

169 

I. B. MacNeill and G. J. Umphrey (eds.), Biostatistics , 169-176. 

© 1987 by D. Reidel Publishing Company. 



170 


N. MANTEL AND S. R. PAUL 


models into a beta-correlated-binomial (BCB) model. A particular situa¬ 
tion for which applicability was seen was in the distribution of a grades 
among students, barring detailed data on separate examination results. A 
BCB model could take into account simultaneously the variation in difficulty 
among examinations and the variation in ability among students. 

Since a hierarchical relationship existed between the BCB model and 
either the BB or the CB model, the significance of any apparent improvement 
in fit could be evaluated. One could determine whether the BB aspect of 
the BCB model was needed for a good fit, or if the CB aspect of BCB model 
was needed or if both aspects were needed. 

When all three models were fitted and goodness-of-fit chi squares cal¬ 
culated, a puzzling result was found. For BB and CB, those chi squares 
were the same as Kupper and Haseman had reported, verifying that their 
calculating procedures had been followed. But the BCB chi square in some 
very few instances exceeded either the BB or the CB chi square. 

This anomaly we found, on consideration, to be related to the non¬ 
constancy of litter size in the data. Further, such a variety of improprieties 
was found to attend the calculation of goodness-of-fit chi square as to render 
that statistic unsuitable for its apparent purpose in the circumstance of 
varying litter size. While difficulties can be avoided through use of the 
likelihood chi square, some difficulties remain. 

In what follows, we will describe the problems we found to be asso¬ 
ciated with the goodness-of-fit chi squares. Though our investigation had 
involved an abundance of calculations, we will not be giving the results of 
those calculations here. The case is that once each difficulty is identified, 
the unsuitability of the goodness-of-fit chi square becomes more and more 
manifest. 

Aside from questions of the suitability of the goodness-of-fit chi square, 
we plan to discuss some other aspects of toxicological experiments relating 
to litter size. 


2. SOME BACKGROUND 

Essentially the data considered by Haseman and Soares (1976) and Kup¬ 
per and Haseman (1978) consist of paired observations of the form (r*,n*), 
denoting that litter t is of size n* and contains r* affected individuals. In 
Haseman and Soares (1976), data are presented in the form /(r,n), the 
frequency of litters in which r* = r, n t = n. 

In both papers, in those tables where observed and fitted frequencies are 
shown and compared via chi square, what is given is /(r), the number of 
litters in which r* = r irrespective of n*, i.e. ]C n =i /( r > n )> summation being 
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up through the largest litter size in the data. Chi square is then calculated 
asE _ [^( r ) _ E(f{r))] 2 /E(f{r)). In the examples given r values in excess 
of 3 have been lumped. Note that the E(f(r)) are obtained by maximum 
likelihood fitting. 

For future reference we define s< as — r<, so that in parallel with / (r, n) 
and /(r) we could have g{s,n) and g(s). 

3. PARTICULAR DIFFICULTIES 

A. Non-correspondence of the data being fitted and the data used in assess¬ 
ing goodness-of-fit 

The actual data being fitted are the assemblage of (rj,n<) paired ob¬ 
servations or, equivalently, the set of /(r,n) frequencies. But the set of 
frequencies used in calculating chi square consists of the /(r) values, the n 
values being ignored. Conceivably, the /(r) values could be well fitted, but 
not the /(r, n) values. For the goodness-of-fit chi square to show proper be¬ 
havior, our maximum likelihood fitting should have been to the set of / (r) 
values, given the assemblage of rii values. Of course, it would have been 
foolish and inelegant to fit the /(r) data when we had the /(r,n) data. 

But, note that were all the n< equal, the impropriety of considering /(r) 
rather than /(r,n) would not occur. 

B. Unsuitability of use of a £(0 - E) 2 / E-type formula for calculating chi 
square (O = observed, E = expected, e.g. f(r) and E(f(r))) 

In many instances, consideration of (O - E) 2 /E properly takes into 
account the covariances amongst the O values, but not generally. The general 
approach is to get the variance-covariance matrix of the Oj values, then, 
after deleting the last row and column (or several last rows and columns 
where data have been lumped) inverting so as to get a matrix with elements 
Cj k . Chi square is then calculated as Y^Cj k djd k , where dj = Oj — Ej,d k — 
O k -E k . (In a birth order problem involving sibships of varying sizes, Mantel 
and Halperin (1963) demonstrated a proper calculation of chi square, but 
took advantage of a particular circumstance so as to avoid matrix inversion.) 

Computation of the variance-covariance matrix to be inverted is straight¬ 
forward. We will consider the individual litter with outcome ( u, m), though 
adaptation for/(r,n) values is simple. 

For litter i, suppose that the fitted probability that r; J is P, (J), i-c* 

p’(l)’...,P < (n i ) with Pi{j) = 0 for j > n,-. These Pi{j) values are the 
contributions of litter i to the Ej values, while the contribution to the Oj 
values is 1 for j = r<, 0 for all other j. The variance of the contribution to 
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any Oj value is P*(j)(l — -P»(j)) while the covariance of the contribution to 
Oj and to O* is —The overall variance-covariance matrix of the 
Oj is simply the sum of the individual litter contributions to the matrix. 

C. The need to modify chi square when comparing alternative lit tings 

Essentially, when we have computed our chi square properly, it tells us 

how much worse our fitted values match the data than would the results of 
a perfectly fitting model. But, how should we judge the comparison of two 
imperfectly-fitting models? 

The chi squares for two fits could vary because of differences in their 
( dj ) vectors or differences in their ( Cjk ) inverse matrices. To make proper 
comparison, the (cjk) inverse matrix must be kept constant between any two 
fits. Mantel and Patwary (1961) have described the need to keep constant 
the weighting function in assessing alternative fittings. An illustration was 
given by Mantel and Brown (1973) of data fitted by a variety of models. For 
any two models which nested, then the simpler model (the one requiring the 
fewer fitted parameters) would be considered the null model to provide the 
common weighting function, i.e. the Cjk values or, when appropriate, the 
1/Ej values. The difference in the chi squares is then influenced only by the 
difference in the (dj) vectors. 

D. Do we count the responders (e.g. defectives, deaths, etc .) or the non¬ 
responders (e.g. normal fetuses, survivors, etc.) 

Suppose we had turned our data around so as to focus on the normal 
fetuses, i.e. the (s*,nt) data and the #($, n) frequencies. Instead of our 
considering a litter of size 10 to have 2 defectives, we would consider it to 
have 8 non-defectives. Gross et a1. (1970) in studying rat reproduction 
emphasized the number in the litter surviving through weaning, ignoring 
the losses at earlier stages. 

Considering the Si rather than the r* would make no difference were 
the rti all equal. And even with the n* unequal, there is really no change 
in the fitted parameter values or in the true fit to the data. Where the 
change comes about is in the calculation of chi square relative to the g(s) 
frequencies. The chi square relative to the g(s) frequencies may not accord 
with the chi square relative to the /(r) frequencies, let alone with the g(s , n) 
or equivalent f(r,n) frequencies. Compensating errors could keep either or 
both of these chi squares small even if the more detailed data are fit poorly. 

Summarizing A - D 

With varying litter size, the data used in calculating the goodness-of-fit 
chi square are not the same as the data actually fitted. Apart from that, 
the chi-square computation formula becomes inappropriate and, instead, a 
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matrix inversion approach would be needed. Litter size differences can also 
result in different assessments of goodness-of-fit according to whether one 
looks at the marginal distributions of defectives or deaths in a litter or at 
the marginal distribution of normals or survivors in a litter. Apart from 
litter size variation, if alternative models are being compared via goodness- 
of-fit chi squares, then the same weighting function must be used for both 
fittings. 


4. THE LIKELIHOOD CHI SQUARE 

The difficulties associated with use of the goodness-of-fit chi square under 
circumstances of varying litter size largely disappear when it is the likelihood 
value which provides the test statistic, i.e. the likelihood chi square. But, 
an important caution remains. 

First off, the calculated likelihood for the fitted model relates to the ini¬ 
tial (r*, n t ) paired values, or the equivalent /(r, n) frequencies. The clumping 
into f(r) frequencies plays no role in calculating the likelihood. Next, it is 
immaterial to the fitting, and to the calculated likelihood, whether we con¬ 
sider our observations to consist of (r*,n*) pairs and /(r, n) frequencies, or 
if we considered instead the (s,,n t ) pairs and the <?($, n) frequencies. But 
caution is necessary because our data could very readily be insufficient for 
proper asymptotic behavior of our likelihood chi square—some other way of 
evaluating the likelihood could be necessary. 

Consider the case of a single fixed litter size, say n = 10. There would 
be 11 possible outcomes for a litter with ranging from 0 to 10. Some 
of those outcomes could have extremely low probabilities. Unless the total 
number of litters were rather large, asymptotic behavior for the likelihood 
chi square could not be expected. The goodness-of-fit chi square, acceptable 
in the fixed n case, might be applied by collapsing outcomes. 

But with varying n, the possible number of outcomes gets very much 
larger. In an example given by Haseman and Soares (1976), n ranged from 
1 to 20, so that the total number of cells would be 2 + 3 + ... + 21 = 230. 
The apparent degrees of freedom would be 210 minus the number of fitted 
parameters. In even a study of large size, the largest number of cells would 
have trivially small fitted expectations, and aictual frequencies of zero. 

It would appear that the likelihood chi square would be unsuitable for 
gauging goodness-of-fit, yet it should still be suitable for comparing alter¬ 
native fittings by models with fewer or more fitted parameters, with one 
model a special case of the other, i.e. a nesting or hierarchical relationship 
exists between the models. The many cells with zero observations would 
play no role in distinguishing between the likelihoods. Likelihood chi square 
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for distinguishing between fits would be with limited degrees of freedom. 

Illustration of how the likelihood chi square, as well as the goodness- 
of-fit chi square, can be used to distinguish between alternative models is 
provided by Mantel and Brown (1973). 


5. DEFINITIONS OF r* AND n* 

Simple fitting of the (r*,n*), testing of alternative models, and even 
assessing treatment effects based on such data can be an incomplete analysis. 

Consider that the relationship between and n* can be altered by non¬ 
specific effects of treatment on the response of interest. If treatment in¬ 
creases, or decreases, the viability of defective fetuses through some early 
stage of gestation, it will increase, or alternatively decrease, the relative 
frequency of such defective fetuses at a later stage. More generally, differ¬ 
ential viability of defective and normal fetuses resulting from treatment will 
influence the relative frequency of defective fetuses apart from any specific 
teratogenic or anti-teratogenic effect (or other specific adverse or protective 
effect) of treatment. 

This difficulty can be addressed by studying the influence of treatment 
on ri alone, on Si = ri* — r* alone, and even on n* alone, i.e., that treatment 
may affect r*, S{ and n*, separate analysis of treatment effect on each of 
these indices would be helpful. Also, quite apart from treatment effects, the 
data can be analyzed for effects of litter size itself on the governing model 
parameters. Thus, litters can be grouped by broad litter-size categories, 
with parameters fitted overall and also separately by broad category. 

In their work, Haseman and Soares (1976) quite reasonably took for 
rti the number of implants in a pregnancy, r t * being the number of dead 
implants—in this case, interest was in dominant lethal effects. But in other 
situations, would necessarily have to be either the number of unresorbed 
fetuses or live unresorbed fetuses, with the number of such fetuses with 
some specific defect. Where pregnancies have to be carried to term, ri* might 
be either the number of fetuses carried to term (in which case a candidate r* 
could be the number of stillbirths), or n t might be the number of live births 
in the litter. 

A somewhat comprehensive view of this kind was taken by Gross et 
aI. (1970) in their study relating to rat reproduction. While their interest 
centered on the numbers per litter surviving to weaning, more specifically 
21 days post birth, they considered that there were various stages in the 
reproduction process. 

Thus, drug treatment might influence the likelihood of conception at 
a mating. Next, it might influence the number of implants occurring in a 
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successful mating. Successive inroads on surviving pregnancy would occur 
as there were embryonic deaths, stillbirths, perinatal deaths in the first four 
days after deaths, and finally, deaths between 4 days and weaning at 21 
days. Since their interest was on rat reproduction generally, not on specific 
aspects, Gross et aI. (1970) focused on the number of 21-day survivors. In 
a sense, they were analyzing only the Si values, without regard to the n* 
values. But, if the separate stages were considered, the s* value at one stage 
would become the value for the next stage. 


6. SUMMARY 

When litters vary in size certain goodness- of-fit chi-square tests for ad¬ 
equacy of statistical models, e.g. a binomial model, a beta-binomial model 
or a correlated beta binomial model can become unsuitable. Reasons in¬ 
clude that the data by which the goodness-of-iit is gauged are not identical 
to the data actually fitted. Also, the blindly followed computation of chi 
square does not properly take into account the variance-covariance struc¬ 
ture of the data. In any case, the goodness-of-fit chi square, even were the 
variance-covariance structure taken into account, would differ according to 
whether one were considering responders or non-responders. In comparing 
alternative models, changes in the variance-covariance matrix should not 
be allowed. While a likelihood chi-square approach could bypass most of 
these difficulties, in the case of varying litter sizes, the likelihood chi square 
could be unsuitable for assessing goodness-of-fit, but suitable for distinguish¬ 
ing between alternative fittings. Concerns for other aspects of toxicological 
experiments involving litters must be carefully considered in any proper eval¬ 
uation. 
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DERIVATION OF LARGE SAMPLE EFFICIENCY OF 
MULTINOMIAL LOGISTIC REGRESSION COMPARED TO 
MULTIPLE GROUP DISCRIMINANT ANALYSIS 

ABSTRACT 

The comparison of logistic regression and normal discrimination (Efron, 
1975) is extended to the case of more than two response groups. The large 
sample distribution of the maximum likelihood estimate of the log odds ratio 
is derived for each of the two procedures, using matrix calculus methods, and 
relative estimating efficiency (RE) is then defined as the inverse of the vari¬ 
ance ratio. It is shown that Efron’s efficiency measure, asymptotic relative 
efficiency, is a special case of the measure given here. We evaluate two cases 
in which RE is parameterized only by the response group frequencies and 
the log odds ratio parameters. The results are presented in two-dimensional 
contour diagrams. 


1. INTRODUCTION 

Cornfield (1962) developed the binomial logistic risk function in the con¬ 
text of a two-group discriminant function analysis; one group being individ¬ 
uals with clinically manifest heart disease, and the other, those without. He 
sought a parsimonious model of the joint dependence of coronary heart dis¬ 
ease risk on serum cholesterol and systolic blood pressure in data from the 
Framingham follow-up study (Greenhouse, 1982). In discussing this model, 
Cornfield noted that “our present interest is not in discrimination ... , but 
... to study the quantitative nature of the dependence of risk on serum 
cholesterol and systolic blood pressure levels.” 

The analogous multiple group problem assumes that a vector of explana¬ 
tory variables has a multivariate normal distribution with equal covariance 
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matrices, but different mean vectors, in each of several populations. For 
example, in Cornfield’s problem, the coronary heart disease group might be 
divided into two mutually exclusive groups: individuals with myocardial in¬ 
farction or individuals with angina pectoris. Indeed, this approach might 
be more informative since “the combination of these two conditions into a 
single category is not, of course, intended to imply that they necessarily have 
the same etiology” (Cornfield, 1962). Expressing the logistic parameters as 
functions of the normal parameters, application of Bayes theorem in the 
discrimination model yields the multinomial logistic probability functions, 
(Lachenbruch, 1975). 

There are two different procedures for estimating the logistic parameters: 

(1) substitution of the unconditional maximum likelihood estimates of the 
normal distribution parameters into the logistic parameter expressions, and 

(2) maximum likelihood estimation based on the conditional probability of 
being in a particular group given the explanatory variables. The former 
procedure is referred to as multiple group discriminant analysis and the 
latter, multinomial logistic regression. Details are given in Section 2. 

Efron (1975) compared the efficiency of the two group linear discrimina¬ 
tion and binomial logistic regression procedures with respect to classification 
error rate and found that logistic regression is between one half and two 
thirds as effective as normal discrimination when the normality assumptions 
are satisfied. He used asymptotic relative efficiency, a measure of relative 
test efficiency, to compare the two procedures. 

As Cornfield’s (1962) comments indicate, it is typical in epidemiologi¬ 
cal research to be interested in estimating the effect of explanatory factors 
on the risk of an individual being in one of two or more response groups. 
We therefore consider the relative efficiency of the two procedures in terms 
of coefficient estimation and extend the comparison to the case of multiple 
groups. The large sample distribution of the estimate of the log odds ratio is 
derived for each of the procedures, using matrix calculus methods. Relative 
estimating efficiency is then defined as the inverse of the asymptotic vari¬ 
ance ratio. It is shown that Efron’s efficiency measure, asymptotic relative 
efficiency, is a special case of the measure given here. 

We evaluate relative estimating efficiency, focussing on its relationship 
to the value of the log odds ratio for three response groups in the case of two 
explanatory variables. Of secondary interest is the effect of response group 
frequencies on relative efficiency. 



LOGISTIC REGRESSION AND DISCRIMINANT ANALYSIS 


179 


2. ASSUMPTIONS, DEFINITIONS AND NOTATION 

A random variable X can arise from one of J + 1 p-dimensional popula¬ 
tions such that: 


X — E) with probability fly (j = 0,1,..., J). 


The normal populations thus differ in mean but not in variance-covariance 
structure. The space is exhausted by the populations in that ir y > 0 for all j 
and J2j=zo = 1, so 7To = 1 — Yhj=i We observe n independent realiza¬ 
tions of the random variable X denoted by the pairs ( 2 *,Xt) i = 1,2,.. .,n, 
where Zi indicates from which population x* came and x* is a vector of 
measurements. We can define indicators % where 

^ _ f 1 if x* is from population j 
%0 10 otherwise. 

We also define A y = log(7ry/flo), j = 1,2,..., J. 

Given these assumptions, we consider two approaches to the estimation 
of the logistic parameters /?o j and /?*• = • • - ,Ppj)- 

The normal discrimination procedure, based on the full likelihood L\ = 
njLi /(*»> x *)> produces maximum likelihood estimates of the normal pa¬ 
rameters fly, /iy (j = 1,..., J), fA 0 and E. The parameters /?o j and fij are 
then estimated by: 

A = l5- l ^-Ao). 


The logistic regression procedure is based on the conditional likelihood 
L 2 = nr=i /(*»l x *)> where, given the values Xi,... ,x n , the Zi are condition¬ 
ally independent multinomial random variables with parameters 


00 (x<) = Pr (Zi = 0|x<) 
and 


1 + ^ exp{/?oA; + P'k x i) 

k =1 


0y(x») = Pr {zi = j|x<) = exp{/3 0j - + jS'xJ/ 


J 

1 +E) ex p{^°*+^fc x «) 


k= 1 


The maximum likelihood estimates flo y and /9y are obtained by maximizing 
L 2 with respect to j3o y and j9y (see Cox, 1970; Mantel, 1966). 
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We express the estimates $oj, and $oj, more concisely using the 

matrix notation B and B, where B = (Bq\B,), 

B' 0 = (Pou---,Poj) and B t = (/3 U ... ,0j). 

A comparison of the two approaches to estimation is made in terms of 
large sample variance. This is appropriate, since the logistic regression and 
normal discrimination estimators have asymptotic normal distributions (see 
Theorems 1 and 2). Since we are comparing estimators with variances of 
order (1/n), we measure the relative estimating efficiency of the logistic re¬ 
gression estimator, * 2 j to the normal discrimination estimator, *i, by the 
variance ratio: V(ti)/V(t 2 ), where V(ti) and V'^) are asymptotic vari¬ 
ances. In particular, Theorem 3 gives the relative efficiency of the logistic 
regression estimator of the parameter /?n, where /?n is interpreted as the 
increase in the log odds ratio of an observation being in population 1 relative 
to population 0, for a unit increase in x±. 

We define the following vectors and matrices: 
e« is a p x 1 vector with 1 in the sth position and zero elsewhere, 
ij is a J X 1 vector with 1 in the jth position and zero elsewhere, 
l m is an m X 1 vector of l’s, 

/ m is the m X m identity matrix, 

0 mXn is an m X n matrix of zeros, and 

diag[di,.. .,d m ] is a diagonal (or block diagonal) matrix constructed 
from the elements (or matrices) di,..., d m . 

Also useful are the three matrix operators: ®, the Kronecker product; 
vec and vech. The Kronecker product of matrices is defined by A ® B = 
{. a ijB}prxq» (‘ = 1, . • • ,p; j = 1, • • •, tf), where we have A pXq = {a^} and 
B rXe . The vec and vech operators, discussed by Henderson and Searle (1979) 
and by McCulloch (1982), convert matrices into column vectors. The oper¬ 
ator vec(X) takes the columns Xi, X 2 ,..., x q of a px q matrix X and stacks 
the columns so that vec(X) is a vector of length pq. For m x m symmetric X , 
vech(X) is defined similarly including only the distinct elements of X , that 
is, the elements on or above the diagonal. Vech(X) is thus a column vector 
of length |m(m+l). It can also be shown that vec (ABC) = (C'® A)vec(B). 

Three special matrices may be defined in terms of relationships among 
the column vectors vec(X), vec(X') and vech(X). The pq x pq vec- 
permutation matrix /( p>q ), defined by yec{X pXq ) = /( P)<5r )vec(X' X(J ), is re¬ 
lated to Kronecker products by the identity B rX€ ® A pXq = I{ Pt r){A pXq ® 
B rX8 )I(8 t q)- For m x m symmetric X> we can define the matrices G and H 
such that vec(Jf) = Gvech(X) and vech(X) = Bvec(Jf). The dimensions of 
G are m 2 by -m(m+ 1), while those of H are |m(m + 1) by m 2 . H is a 
non-unique left inverse of G, one choice being H = (G , G)“ 1 G'. Magnus and 
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Neudecker (1979) discussed the definition and results stated here and also 
reviewed earlier literature on the subject. 

3. LARGE SAMPLE DISTRIBUTION OF DISCRIMINANT ESTIMATES 

Lemma 1. Assuming that a random variable X can arise from one of J + 1 
p-dimensional normal populations such that 

X ~ A p (/iy, E) with probability n y (j = 0,1,2,, J), 

the normal discrimination procedure produces the following maximum like¬ 
lihood estimates of the normal parameters: A = {A} = {{ftj/no)}, Ay == 

Yl?=z i% x i/ n y> and £ = vech(E), where ny = £)" =1 *y = n y/ n > and 
E = Z^y=o^i( x * Ay)( x * ~ Ay)V n * Using the usual methods based 
on Fisher’s expected information matrix, one can derive asymptotic distri¬ 
butions for the maximum likelihood estimates: 

(a) L[VE(*-A)]—> Nj{0,V x ), 

(b) —Ap(0,V My ), 

(c) L[Vn(o- - <t)] —► W«(0, V E ), where q = |p(p + 1), 

with A, Ay (j = 0,1,..., J) and a asymptotically uncorrelated, where = 
diag[?rf *,.. .jTrJ 1 ] + n^ljlj, Vp. — fl^E, and Vs = 2B(E ® E )H'. The 
matrix H is of order p(p+l)/2 by p 2 and may be defined as H = (G / G)” 1 G / . 
McCulloch (1982) has given an analytic definition of G. 

Theorem 1. The logistic parameter estimates, Bo and B*, produced by 
the normal discrimination procedure have the asymptotic distribution: 

L[y/ii vec(B - B)] —► N{ 0, Ei(B)), 

where 

-y rm= \ Var (^o) Cov(B 0 ,vecB*) 
n 1{ } [Cov(vecB*,B 0 ) Var(vecB,) 

with 

Var(Bo) =- Q + —(l + V 0 )l/lj + S{U'V- l U ® Q)S' 
n [ 7To 

+ ls(CT # E -1 l7 ® {/'E -1 i7 - U'q’Z-'Uo ® Uq'E~ 1 U 0 )S i , 
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Cov(B 0 ,vecjB.) = — — (*»(,£ 1 ® ljlj) + Sill'll 1 ®<5) 
n L^o 

- S(U' E" 1 ® J7'E -1 ([7 - U 0 ) 

- U' 0 E- 1 ® U^-\U - U 0 )) , 

Var(vecS,) = - E _1 ® (Q + —ljl'j + E _1 ® (U - UoYZ-'fU - U 0 ) 
n n 0 J 

+ (E -1 ({7 - Uo) ® (U — J7o)'E -1 ) I( J>P ) , 

and 

Q = diag[7rf 1 ,... ) jr7 1 ] ) 
u = ivM ■ ■ - \Pj), 

Uo = Molj* 

s = E , <( 1 i® , y)- 

y=i 

Corollary 1.1. The large sample variances and covariances of the logistic 
parameter estimates, p ej (s = 1,2,...,p; j = 1,2,..., J), produced by the 
normal discrimination procedure are: 

VarOy =!{<,•• (A + i + (* - - *)) 

+(My ~ #»o) ,£-1 e.eiS" 1 (/i J . - A*o)}, 

Cov0 ej J' k ) =i|<r" + (m, - - Mo)) 

+ (My - Mo)^ -1 ®*®^ -1 ^* - Mo)), 
Cov(/9.y,&y) 4{'“ + 1 + (My - Mo^^My - Mo)) 

+ (My - Mo)'S -1 e t eiE _1 (#iy - Mo))> 


where a 8t = e^S 1 e t . 
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Proof of Theorem 1. The normal discrimination procedure produces es¬ 
timates of the logistic model parameters that are functions of the normal 
parameter estimates. From Lemma 1, we have the asymptotic distributions 
of A j y faj and E. The asymptotic distributions of /?oy, can then be ob¬ 

tained by means of the multivariate “delta” method (see Bishop et a/., 1975, 
Theorem 14.6-2): 


L[y/H vec(B - B)] —► N 0, 


where $' = (A',/4,/i'. = vec'J7, and the relationship between B and 
0 can be expressed in matrix notation by: 


dvecB 

A 

dvecB 


Var (0) 

a*' 


Let 

and 

Then 


5o = A-^i 3 -0 {^[U'^U - Up-'Uofc} , 

B. = (U - UoYS- 1 . 

\ dvecB 1 


M = 


dB' 


= [M X \M» 0 \ M M JM E ] 


Var(tf) = Idiag[V- A) V #to ,V) i . ) V E ] 


Var(vecB) - M[Var(0)]M' 

= i [ M \ v x u ’x + , 


where 


M\ 




dvecB 


dX' 

dvecB 


M, 


M. 


M e = 


dvecB 


L°pJxJj 

Ij 


L J 


( 1/0 2 - 1 ), 


dvec'/i* 
dvecB 


hp.j) 


(Ij® s _1 ), 


Svech'E 


-(I P ®{U-U 0 )>) 


(E -1 ® E -1 )(?, 
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and Vfi m = diag^jf 1 ,.. .jTTj 1 ] ® E. The other variances are as stated in 
Lemma 1. The derivation of the M matrices follows from the matrix calcu¬ 
lus results of Dwyer and MacPhail (1948), MacRae (1974) and McCulloch 
(1982). Details are given by Bull (1983). 

Evaluating Var(vecB) we consider: 


M X V X M' X = 


V\ 


o Jxpj 


L°pJxJ Opj X pj J( p+ 1 )j x ( p+1 )j 


L®pxi 


0 


J 1 xp 
'pxp J 


®v 


A’ 


MnVnML = — 

l*o Mo Mo 

_ J_ 
*o L 


(1 1' ® MoS-Vo) (1 1 # ® /i&E" 1 ) J ( J>P ) 

/ (p,/)( 11, ® s Vo) Ap.-oC 1 i'® 2 ^h-r.p) 

M f-Vo° Ij-i 1 ]® 11 '* where 1 = lj, 




S{Q®U , '2~ 1 U)S' 5(g®£T'E- 1 )/ (J p ) 

I (PtJ) {Q ® S ~ l U)S' I( P ,j ) (Q ® S^)7 ( ;, p) J 


and 


SiU'X-'U® Q)S' S(W s- 1 ® g)' 
(E- 1 C7®g)5 / E -1 ® Q j’ 


m e v e m£ = 


( 1 ) ( 2 ) 

[(2)' (3) 


where 

(1) = SWV-'U ® U'V-'U - Z7qE -1 17o ® Uft- 1 U 0 )S'/2, 

(2) = -S^'S" 1 ® U'L-'iU - U 0 ) - Uft- 1 ® Ufi-'iU - U 0 )), 

(3) = (E- 1 ® (U - U 0 )'2- l {U - U 0 )) 

+ (E _1 (J7 - U 0 ) ®(U- UoYH- 1 )^). 

Therefore, 

nVar(So) = V A + l' + S^'E" 1 !/ ® Q)S' + (1), 

7Tq 

nCov(J3 0 , veci?*) = — (/IqE- 1 ® 1 l') + S^l/'E” 1 ® Q) + (2), 

*0 

nVar(vecR*) = —E - ” 1 ® 1 1 ; + E” 1 ® Q + (3). 


and 
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Furthermore, one obtains the result of Corollary 1.1 by evaluating 

Var(/3 SJ ) = (e' ® i' )Var(vecB,)(e a ® iy), 

Co hi) = (e a ® i' )Var(vecB,)(e t ® iy), 

and 

Co v{hj,hk) = (e' ® i')Var(vecB«)(e s ® i fc ). 


4. LARGE SAMPLE DISTRIBUTION OF LOGISTIC 
REGRESSION ESTIMATES 

In finding the large sample distribution of the logistic regression esti¬ 
mates, it is easier to work in the so-called standard or canonical situation 
in which the distributions are in a simpler form. The standard situation is 
specified by standard population mean vectors which are related to general 
population mean vectors by a linear, possibly non-homogeneous, transforma¬ 
tion represented by the matrix A. The asymptotic variance of an estimator 
in the general model can then be expressed sis a function of the variance- 
covariance matrix of the estimators in the standard situation; this is stated 
without proof in Lemma 2. Lemma 3 defines precisely the standard situation 
and the transformation matrix A for any particular case. We show that A 
is a function of the population means and variances in the general model. 

Lemma 2. We assume that a random variable X can arise from one of J +1 
p-dimensional normal populations such that 

X ~-Np (/*,•>£*) with probability 7ry (j = 0,1,..., J). 

By means of a linear transformation Y = A X + a, we obtain the random 
variable 

Y ~ i\T p (vy,E y ) with probability fly, 

where Vy = A/iy + a and E y = AE X A'. 

If B 0 , are the parameter estimates produced by the logistic regression 
procedure and based on the original random variable X and Go, G* are the 
corresponding estimates based on the random variable Y, then it can be 
shown that 


B 0 = G 0 + (a' ® 7j)vecG* and vecB* = (A 7 <g> Jj)vecG*. 
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It follows that 

Var(Bo) = Var(G 0 ) + (a' ® Ij)Var(vecG*)(a ® Ij) 

+ (a' ® Ij)Cov(vecG*,G 0 ) + Cov(G 0 , vecG*)(a ® Ij ), 
Var(vecB*) = (A' ® Jj)Var(vecG*)(A® Ij) 

and 

Cov(S 0 , vecB*) = (a'® Jj)Var(vecG*)(A® Ij) 

+ Cov(G 0 , vecG*)(A ® Jj). 

Lemma 3. We obtain the standard situation in which Y ~ JVp(vy, I p ) with 
probability Try, by the linear transformation Y == AX + a. 

The transformation takes the population means p 0 ,... ,/iy in the met¬ 
ric E ” 1 in p-space into the corresponding population means v 0 , in 

the metric J. This is equivalent to a change of coordinates from an affine 
co-ordinate system in X\ ,..., X p to an orthogonal co-ordinate system in 
Yi,.. - ,Y P involving: 

(i) translation of the origin to the point /i 0 , 

(ii) change of axes from an oblique orientation (i.e., the axes X 8 and X t 
are not perpendicular) to the usual cartesian orientation (the axes 
Y 9 Yt are perpendicular), and 

(iii) rotation of axes around the origin so that v x is on the Y\ axis, V 2 is 
in the Y\ x Y% plane, V 3 is in the Y\ x Y 2 x Y$ plane, and similarly. 

We then have v 0 = 0 and vy = Vly (j = 1,2 ,..., J), the jth column 
of the p X J matrix V . V is such that vi > v 2 , ...,v m comprise a set of 
linearly independent vectors and v m +i, ..., Vj are each a linear combination 
of Vi, .. .,v m . Here m, the dimension of the space spanned by Vi,..., vj, 
is less than or equal to the minimum of J and p. 

The matrix V may be expressed in partitioned form as: 

v =\ n VllVi 1, with V l = {v ii y, V t = {v ik }, 

. u (p—m) X J . 

where 

v .. _ J A, sin HUi cos * < 3 

1 ° *> 3, {*,3 = 1 , 2 , 

v - J A fc sintfj_i ( jfe UZ1 1 cos 9ik 1 < t < m - 1 
* \ A*sin0<_i >fc * = m, (fc = m+ 1,..J), 
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and 

Ay = [(Py “ Mo)'S _1 (My - Mo)] * > 

cos e tj = (n t - M 0 ) , 2 _1 (/ i y - /i 0 )/AiAy, 

and 

sin 0^ = (1 - cos 2 0 4 y)*. 


It follows that v'-vy = A* (j = 1,2,The transformation to Y is 
specified by A pXp and ai Xp : 



(vp-'Cs-'U aY 

(E-^b)' J ’ 



AaMo ] 
Ab/ioJ ’ 


where (7 a = (a*i|... |/i m ) — /i 0 l^, an d * 8 constructed such that 
U' a E~ 1 Ub = O mX( p-m) and U^Ub = I p . m . 
Note that A 1 A = E _1 implies that A a Aa + A* b Ab — E -1 . 


Theorem 2. The logistic parameter estimates, Bo and 2?*, produced by 
the logistic regression procedure have the asymptotic distribution: 

L[y/n vec(S - B )] —♦ i\T(0,E 2 (S)), 


where 

Iv rm- f Var(B°)^ Cov(2?o,vec£*) 
n 2 ' Cov(vecJ3*, 2J 0 ) Var(vecB*) ’ 

Var(Bo) = - { (a^ ® Ij - fl^B,) ® Ij - B'^o) 

n 

+ (1 + a^as)!}^ 1 } > 

Cov(B 0 , vecB.) = ^ {(a' A ®Ij- UqqB 2 ) E 2 1 (A a ® 7j) 

+ {&' b Ab ® , 

Var(vecB«) = - {(A^ ® Ij)E^ 1 {A a ® 7j) + A B A B ® floo*} » 

tl 


with 
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E$ = B$ - B^QqqBzj 

Bz = (floi 1^021 • • 

and 

Bz = {n*t} (M = 1,2,...,m), 

where the fi*t are J X J submatrices of the Fisher information matrix eval¬ 
uated in the standard situation. 

Corollary 2.1. The targe sample variances and covariances of the logistic 
parameter estimates, /3 a j (s = 1,2,... ,p; jT = 1,2,..., J), produced by the 
logistic regression procedure are: 

Var(^ y ) = 1 {(a' ® ® L,) + (<r“ 

Co = 1 {(a'. ® i'^H a, ® i fc ) + 

Cov&jJi,) = 1 {(a'. ® I»JE ?3 l (a* ® i,-) + (v at 

where a* = and E” 1 = s, t = 1,2,.. .,p. 

Proof of Theorem 2. In contravention of the usual maximum likelihood 
theory assumptions, the multinomial random variables Z{ are not identically 
distributed. Rather, the distribution of each z* is parameterized by ^-(x*) 
(j = 1,2,..., J) which depend on x*. However, the distributions are related 
in that the logistic parameters j and (j — 1,2,..., J) are shared. 

This situation, involving related, or associated, populations was consid¬ 
ered by Bradley and Gart (1962). They give conditions under which the 
usual properties of consistency and asymptotic normality hold for maximum 
likelihood estimators from associated populations. Hauck (1981) adopted 
assumptions more restrictive than those of Bradley and Gart in order to 
convert the associated populations situation into the identically distributed 
situation. Usual maximum likelihood theory can then be applied. 

Of interest here is the case in which the number of possible values of x 
is infinite. We therefore assume: 

(i) the pairs (z*,x*) are independent and identically distributed with 
density g(z,x) = /(z|x)A(x), where 

iW ; 

>= 0 


-a'.a.Hi'fWiy)}, 
- a',a t )(i;noo 1 iy)} , 


/(* W = 
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(ii) h(x), the marginal density of x, does not depend on any of the 
logistic parameters and is given by 

M x ) = E^(^72 ex P {-^( x - My/S-^x - My)} ! 

(iii) g (z, x) satisfies all the conditions required for consistency and asymp¬ 
totic normality of maximum likelihood estimation (see Cox and 
Hinkley, 1974, Chapter 9). 

Following Hauck’s derivation, the log likelihood is then 

n 

= £ln{s(*t,x<)} 

4=1 

n 

= £0 n /(*»|x<) + In h(xi)}, 

4=1 


where 7 denotes the parameters of h(x). By assumption (ii), we have: 

dt(B, 7) _ d 


dvecB dvecB 
d 2 t(B, 7) 


:£ ln /(^|Xi)|, 


and 


dvecB d'f 

d 2 m 7 ) 


7 = 0 , 

n 

=E 


9 s 


3vecB 5vec f B ' dvecB dvedB 

4=1 


{In f(zi\jCi)}, 


Thus, vec B is asymptotically independent of 7 with distribution such that 
L[v/^(vecB - vecB)] —► JV y(p+ 1 ) ( 0 , E 2 (B)), 

where 


S 2 “ n ^g(z t x.) 


This follows from 


f -dH{B^) ] 
\ dvecB dvedB f 


= B h(x) 


f —d 2 ln f(zjx) ) 
\dvecJB dvedB J * 


^» x ( w ( x ») ) “ 


< 4=1 


£^|x(w(x<)) 


L»=l 

n 


= E, 


L«'=i 
n 2 £ x [u/(x)], 
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where w(x) is a function of x and $ only. 
Let 


n = Sx { Q) (i x') ® P(x) j = j (1 x') ® P(x)Mx) dx, 


where P(x) = Pd(x) - P v { x) = diag[0i(x),.. .,0j(x)] - 0'0 with 0 = 
(0i(x),.. .,0j(x)). Furthermore, Q = {f2 at } (M = 0,1,2,.. .,p), where 
each n 6t is a J X J matrix: 

(l 9t = J x s z t P(x)h(x)dx. with x 0 = 1 . 

In the standard situation, as described in Lemma 3, we have E y = I p , Vo = 0, 
Pok = IosCtt^/tto) - and 0 k = v*, where A| = and 

V = (v 1 1v 2 1 ... |vj) is a p x J matrix with entries v Sj . = 0 whenever s > j 
or s > m. Usually m will be equal to the minimum of J and p, but if 
the population means Vo,..., v j lie in an r-dimensional subspace of p, then 
m = r. 

Thus, 


%(y) = exp{ln(jr J /jr 0 ) - ^A 2 + v'y} 

• 1 + E exp {( ln (W*o)) “ \ A l + v fcy } 

k = l ' ' - 

and 

^j(y)ft(y) = exp |-^A J 2 +v'y|< / (y) ) 

where g(y) = (2n*)“ p / 2 exp{-|y'y}. 

Consider / y«ytPi>(y)My)^y> which is a J X J diagonal matrix with 
diagonal terms: 

J y*yt8,iy)h(y)dy 

= Xj(2x) -p / 2 J (y'C'.ty + y'c. t + c at ) exp{-^y'y + y'v y - ^A?}dy, 
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where 


C H = 


Ctft — 


{ 


e»e{ a, t > 1 
0 pXf) otherwise, 

e« « > 1, t = 0 

®t t > 1, s = 0 
0 otherwise, 


_ f 1 s = t = 0 

* 10 otherwise. 

Then, by Graybill (1969, Theorem 10.5.1), we obtain 
Jr j[ tr ( C '.t) + v'C at Vy + v'c.t + C,t]. 


However, 


and 


Therefore, 


= { ' 


8 = t> 1 

otherwise 

1 < s < m 
s > m. 


/ 


ysVtPD{y)h{y)dy 


f R 8 =~-t = 0 

RV s V t 0 < s and t < m, s ^ t 

< R + RV* 0 < s and t < m, s = £ 
0 j x / m < 8 or t, 8 

< R m < s, s = t 


where i? - diag[*i, .. .,ttj], Vb = /j, = diag[v # i, v a 2 ,..v.j]. 

Secondly, consider f VaVtPv {y)h{y)dy i which is a J x J symmetric ma¬ 
trix with terms 


JysytOj(y)O k {y)h(y)dy (j,k= l,..., j) 

= 7 ry^ fc exp{—i(A 2 + Ajfe)} J y.y t exp{y'(v, + v fc )}/(y)y(y)dy, 


where /(y) = [jt 0 + £<=i ay exp{y'v< - f A?}]" 1 . 

Because of the structure of V, /(y) will be a function of y lt y 2 ,..., y m 
(m = dimension of subspace), and exp{y'(vy + v fc )} will be a function of 



192 


S. B. BULL AND A. DONNER 


yi, V2, • • •, Vn, (n = min{m,max(i, A:)}). Thus, 


and 


j y»yt exp{y'(v 3 + v fc )}/(y)y(y)dy 

= / / ex p{y'( v J + v fc)}/(y)(2x) _p/2 

Vl * Vm 




H{y)dyi,...dy„ 


H (y)= f •••^y.ytexp|-i 53 vjjdvm+i... 


dy P 


Vm+1 y p 


Therefore, 


' y . yt ( 2 *)( p -”*)/ 2 
0 

( 2 ^)( p -"‘)/ 2 

.1 


0 < 8 and t < m 
8 > m or t > m, s ± t 
s = t > m 
m = p. 


/ 


y»yt^V(y)h(y)dy 


RDA 8 tD , R / 0 < 8 and t < m 
0j x j m < 8 or t, s ^ t 

RDAqqD'R* m < s — t 


where D = diag[exp(-Aj/2),.. .,exp(-A^/2)] and A et is a J x J matrix 
with entries A at {j, k) = f y a y t exp{y , (v 3 +y k )} f(y)Q{y)dy 1 ... dy m and now 
y' = (yi,y2, - -,ym), yo = 1, and Q( y) = (2jr)~ m / 2 exp{-|y'y}. 

Combining the two integral expressions, we obtain the following expres¬ 
sion for f], t : 


R - RDAoqD'R' 

s = t = 0 

RV a V t - RDA at D'R' 

0 <8 and t < m; s 

R + RV? - RDA aa D'R' 0 < a <m,a = t 

X 

o 

m < 8 or t, 8 ifct 

R - RDAqoD'R' 

m < 8, s = t. 

Bi | 

X (p—m)J 


Ip—m ® n 00 


Therefore, 
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and inverting, 


where 


Bi = 


floo B 2 
B' 2 B z 



' sr 1 i o 



0 | Ip—m ® floQ 1 . 

> 

[ fioo+Hoc \B 2 E^B^o 1 

floo B 2 E$ 1 


- £3- 1 S' Coo 1 | 



B 2 — (fioil • • • |fi 0m ), Bz — {fl«t} s>t = 1,2, .. .,m, 

and £?3 = [B 3 — B^Qqq B 2 ] by Graybill (1969, Theorem 8.2.1). 
Then, 

Var(Go) = — (floo d" floo* B 2 E§ ^^floo*) > 
Cov(G 0 ,vecG.) = -1 , 


and 


Var(vecG„) = 


E, 




Ip—m ® ^00 J 


where G 0 and G. are the estimates obtained in the standard situation. Ap¬ 
plying Lemma 2 and partitioning A and a as in Lemma 3, we obtain the 
expressions given in Theorem 2. Corollary 2.1 follows from substitution of 
S - 1 -A' a A a = A' b A b . 


5. RELATIVE EFFICIENCY 

Theorem 3. Following Kendall and Stuart (1979, §17.29), we define large 
sample relative efficiency of estimation by the ratio of Var(/? 8J ) to Var(/?„y). 
Therefore, based on the results of Theorems 1 and 2: 


RE = 


g *‘ (*o 1 + 1 + A 3 2 ) + filj- 


(a' a ® I;)E 3 1 (a a ® iy) + (a« - aia.Xljn^iy)] ’ 


where a. = (V?)- 1 *^., B A = (/9 X |.. e' 8 B A = (/? 8X ,.. ,,/? 8m ), 

and hence a*a* = b^B^B^BBa) 1 B , A e B . We can rewrite Ay = (/iy — 
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^o)' s-1 (My - A*o) as A? = /3'S/? y . Vi is as defined in Lemma 3, E$ and 
n 00 are as defined in Theorem 2. Furthermore, m is the dimension of the 
subspace spanned by the vectors /?i,/9 2 ,.. When m = p, = ( 7 d5 

and the denominator of RE reduces to the first term. 

Proof. The result follows from Corollaries 1.1 and 2.1 by making the sub- 
stutition (/ly - |i 0 ) = E^-. 

Corollary 3.1. For the case considered by Efron (1975), J = 1 and p 
arbitrary, we have: 

ee _ fa - 1 +*r 1 + **)+# 

1 + (<r« - aDn^ 1 ] 

with a 2 = /? 2 /A 2 and A 2 = jS'EjS. When p = 1, a 2 = a 88 . 

According to Kendall and Stuart (1979, §25.13), the relative efficiency 
evaluated at j3 e = 0 is precisely equal to the asymptotic relative efficiency 
(ARE). Therefore, 


ARE = fi 0 o (tt 0 1 + 1 + A 2 ) 

= (1 + 7To^Tl A 2 )f2oo/ ?r O^'l- 

It can be shown, from the definition of Q 0 o given in the proof of Theorem 2, 
that when J — 1: 

^ * 0*1 r A 2 /ro f exp{-Z 2 /2} 

= (2^ eXp{ - A /8} J [x 0 exp{—Ag/2} + exp{Az/2)] ^ Z ' 

Substituting for Q 0 o in the expression for ARE produces the asymptotic rel¬ 
ative efficiency of logistic regression to normal discrimination for estimating 
the angle of the discriminant boundary derived by Efron (1975). 

A FORTRAN program was developed to evaluate the relative estimating 
efficiency of the logistic regression estimate of for two or three response 
groups and two explanatory variables. The program was run on the CDC 
Cyber 170 computer system, using International Mathematical and Statis¬ 
tical Libraries, Edition 8, subroutines. The subroutine DMLIN, a Gaussian 
numerical integration method, evaluates the entries in the J by J symmet¬ 
ric Cl 8 t matrices. These matrices are then manipulated using theory for 
partitioned matrices (Graybill, 1969). Matrix inversion is done with subrou¬ 
tine LINV2P, which invokes iterative improvement until machine accuracy 
is reached. 

We consider two configurations of the explanatory variables, assuming 
them to be standardized and uncorrelated (that is, the variance-covariance 
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matrix equal to the identity matrix). In the first case, Xi, but not , is 
associated with a response in group 1 and in group 2. This corresponds to 
0i = (/? n ,0) and 0 f 2 = (/?i2>0)- In the second case, X\ is associated with 
a response in group 1 but not in group 2, while X 2 is associated with a 
response in group 2 but not in group 1. This corresponds to 0\ = (/?n,0) 
and 02 = (0, 022)- Without loss of generality, we evaluate RE for /?n, the log 
odds ratio that compares group 1 to the reference population with respect 
to Xi, the first explanatory variable. 

Figure la illustrates the behaviour of RE as a function of 0\ and 02 
for the first case. The population response probabilities are assumed to be 
equal, so that 7 ry is equal to one-third for all three groups. When 0u is equal 
to zero, RE is equal to 1.00 regardless of the value of 0i2 . When 0i2 is equal 
to zero, RE declines monotonically as 0u increases, dropping below 0.80 
when 0n is 2.0. However, for non-zero values of /?i 2 , the rate of decline in 
RE depends on 0i2 and is not necessarily monotonic. RE decline is slowest 
for parameter combinations such that 0 i 2 is equal to one-half of 0 u. 

When the population response probabilities are not equal in the first 
case, the effect of 0i2 on the rate of decline in RE with 0u is much weaker. 
Figure lb displays RE for tti and 7^2 both equal to one-tenth. RE will only 
be greater than 0.80 provided 0n is less than 1.5. 

Figure 2 indicates that RE is consistently lower in the second case, de¬ 
scribed above, than it is in the first case. The rate of decline in RE with 0u 
increases gradually as 022 increases for both equal and unequal 7Ty’s. There 
is virtually no effect due to 022 for unequal Try’s. 

6. SUMMARY AND CONCLUSIONS 

When there are three response groups involved and the response prob¬ 
abilities are equal, RE at equal to 3.0 ranges from 46 to 88 percent in 
the first case and from 38 to 46 percent in the second case, for the group 
2 parameters (either 0\2 or 022 respectively) considered. Similarly, when 
the reference group response probability is large and 0u is equal to 3.0, 
RE ranges from 33 to 56 percent in the first case and is between 32 and 33 
percent in the second. Generally for three response groups, the results in 
the first case tend to be higher than those reported by Efron (1975) for two 
response groups, while those for the second case are lower. 

Although logistic regression is always less efficient than normal discrimi¬ 
nation, the former compares more favourably to the latter for log odds ratios 
close to zero, for observations equally distributed among the response groups 
and for explanatory variables related to more than one response. Further 
investigation is required of the effects of increasing the number of explana- 
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Figure 1. Relative Estimating Efficiency. Contours of constant RE are 
plotted for combinations of the log odds ratio parameters ft* x = (/?n,0) and 
02 = (/?i 2 ,0). The point specified by a fin, /? 12 combination corresponds 
to a value of RE between the closest contour line values, (a) The popula¬ 
tion response probabilities are equal, (b) The reference population response 
probability is large. 



Figure 2. Relative Estimating Efficiency. Contours of constant RE are 
plotted for combinations of the log odds ratio parameters P' x = (fin, 0) and 
02 — {®>P 22 )- The point specified by a fin, j3 22 combination corresponds to 
a value of RE between the two closest contour line values, (a) The popula¬ 
tion response probabilities are equal, (b) The reference population response 
probability is large. 
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tory variables and of trends with the number of response categories. In 
particular, the discovery of the ameliorating influence of log odds ratios for 
a secondary category on relative efficiency needs to be confirmed in more 
than three response groups. 
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ESTIMATION UNDER THE CORRELATED 
LOGISTIC MODEL 

ABSTRACT 

Rosner (1983, 1984) proposed a model for correlated binary outcomes 
in the presence of covariates based on the beta-binomial distribution. Al¬ 
though calculation of maximum likelihood estimates under this model is 
costly, alternative estimators based on the usual logistic model, with or 
without dummy varibles, are asymptotically biased (in general) and yield 
standard errors that are incorrect. An adjustment for the standard error 
of the usual logistic estimator is suggested for the case of one dichotomous 
independent variable. Also a familiar conditional approach is shown to have 
low asymptotic relative efficiency. 

1. INTRODUCTION 

Regression analysis provides a means of determining the effect of ex¬ 
planatory variables on a continuous outcome measure; for example, the ef¬ 
fect of age and sex on blood pressure. The usual method of estimation in 
regression analysis is ordinary least squares. The regression model may be 
refined to allow for correlation among the outcome measures; for example, 
when the blood pressure scores come from the members of the same family, 
the method of estimation becomes generalized least squares. 

Interest has been shown in the effects of the presence of intraclass cor¬ 
relation on ordinary least squares estimation. A large number of references 
has been given by Scott and Holt (1982). These authors have shown that 
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the ordinary least squares estimator is unbiased but is not as efficient as the 
generalized least squares estimator, although the loss of efficiency is small. 
However, the estimate of the covariance matrix of the least squares estima¬ 
tor is also incorrect, and may underestimate or overestimate the true value, 
depending on both the usual intracluster correlation and on the sample in¬ 
tracluster correlation of the explanatory variables (also see Campbell, 1979; 
Donner, 1984). 

Correlated binary outcomes occur in many family studies, for example, in 
the investigation of hypertension, in other cross-sectional studies, in studies 
with repeated measurements on an individual (e.g., measurements of physical 
ability at different levels of physical exertion), and in complex surveys (e.g., 
Rao and Scott, 1981). When the outcome measure is discrete, in particular 
binary, the usual model for assessing the effects of explanatory variables is 
the logistic regression model (Berkson, 1955; Cox, 1966; Day and Kerridge, 
1967). Until recently no extensions of this model have been available that 
would allow for the modelling of intracluster correlation. 

Williams (1982) generalized the beta-binomial model by introducing co¬ 
variates measured at the cluster level and requiring only a specific relation¬ 
ship between the mean and variance of the cluster-level response and the 
covariates. Brier (1980) generalized the beta-binomial model by allowing 
more than two levels of response, and, more importantly, covariates mea¬ 
sured at the unit (as opposed to the cluster) level; however, the covariates 
are only discrete (in the example in his paper, the covariate and the response 
variable are actually ordered but this is ignored in the methodology). 

Pierce and Sands (1975) introduced a correlated logistic-normal model 
which allowed the scores of units within a cluster to be correlated, but they 
only considered the case in which scores are measured on the cluster level. 
The generalization of this correlated logistic-normal model by Laird and 
Ware (1981) and Stiratelli et al. (1984), and the correlated logistic model 
by Rosner (1983, 1984) are models that allow both for a mixture of con¬ 
tinuous and discrete explanatory variables, and for correlation between the 
dependent binary variables. 

The correlated logistic model resembles more closely the usual logistic 
model than does its counterpart, the correlated logistic-normal model. Al¬ 
though somewhat underdeveloped at present, the correlated logistic model 
offers scope for expansion to multinomial outcomes and to more complex 
error structures. 

In this paper, the work of Scott and Holt is extended to discrete outcomes 
by using Rosner’s correlated logistic model to represent the true probabil¬ 
istic mechanism, and the first and second order properties of the estimators 
of “regression” parameters obtained under several incorrect but commonly 
used methods are examined. 
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When there are only two observations per cluster and the explanatory 
variable is itself discrete, the following results are obtained: 

1. Several methods yield estimators that are asymptotically biased, except 
when the regression parameter is zero, that is, when there is no effect of 
the covariate. 

2. When the regression parameter is zero, the true variances of two com¬ 
monly used estimators are, in general, larger than that of the maxi¬ 
mum likelihood estimator, but under some circumstances one of these 
estimators has a smaller variance than that of the maximum likelihood 
estimator (without contradicting the usual theory of inference). 

3. The estimated variance of the usual logistic estimator is incorrect and, 
when the regression parameter is zero, may overestimate or underesti¬ 
mate the true value depending on the value of both the usual intracluster 
correlation and the intracluster correlation of the distribution of the in¬ 
dependent variables. 

4. In order to test the hypothesis that the parameter is zero, the usual 
logistic method may be used with a simple correction for the estimated 
variance of the estimator of the regression parameter. This generalizes 
a result of Brier (1980). 

5. A conditional estimator based on the Breslow-Day methodology for case- 
control studies is shown to have low efficiency. 

2. ROSNER’S MODEL 

Let Yi and y 2 denote the outcome (or dependent) variables, and Xi and 
X 2 the corresponding vectors of explanatory (or independent) variables (or 
covariates). Further let P(t/i,y 2 | xi,x 2 ) denote the conditional probability 
Pr(Yi = t/i,Y 2 = y 2 | X x = Xi,X 2 = x 2 ). Rosner (1983, 1984) proposed 
the model 


P(0,0 I Xi,x 2 ) = 1/d, 

P{ 1,0 | Xx,x 2 ) = exp(ai +^'xi)/d, 

F(0,1 | x 1( x 2 ) = exp(ax + 0'x 2 )/d, 

and 

P(l,l I x x ,x 2 ) = exp[a 2 +0'(x 1 + x 2 )], 

where d is the normalizing constant, namely, 

d = 1 + exp(ai + £ ; xi) + exp(ai + 0'x 2 ) + exp[a 2 + 0'(xi + x 2 )]. 

When both Xi and x 2 are null vectors, the distribution of the sum 
(Yi + Y 2 ) is that of a beta-binomial random variable with parameters 2, 
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a and 6, such that 


ax = In[a/(6+ 1)] 

and 

a 2 = ln{a(a + 1 )/[b(b + 1)]}. 

In addition Y\ and have the correlation calculated by Crowder (1979), 
namely, 

corr(Yi,F 2 | 0,0) = l/(a + b+ 1). 

However, this simple correlation does not exist for all values of Xi and X 2 ; 
in general, it may be shown that: 

corr(Yx, Y 2 | Xx,x 2 ) = |[exp(a 2 ) - exp(2a 1 )][exp(/S , {xx + x 2 })] 1/2 | / 

({[expai + exp(a 2 + /9'xx)][expai + exp(a 2 + /S'x 2 )] 
X [1 + exp(ax + ^ / x 1 )][l+exp(a 1 +^'x 2 )]} 1/2 ). (1) 

A more general way of writing Rosner’s model is in terms of the extended 
exponential family of Dempster (1971); that is, 

InP(yx,y 2 | xx,x 2 ) = ax(y x + y 2 ) + P'{x.iyi +x 2 y 2 ) 

+ (a 2 - 2ai)yiy 2 - In d, 


where d is defined above, and 

1. the condition ot 2 > 2ai yields the beta-binomial when Xi and X 2 are 
both null, 

2. the condition a 2 = 2ai yields the usual (or uncorrelated) logistic model. 

Because of the complexity of the model, explicit solutions to the maxi¬ 
mum likelihood equations are usually not attainable, although in the case of 
a single binary covariate some simplification is possible. Alternative methods 
for estimation are provided by: 

1. the usual logistic model, that is, ignoring the effect of clustering, 

2. the usual logistic model with dummy variables to handle the effect of 
clustering, that is, with variables such that each takes on a value of 1 in 
only one cluster and 0 elsewhere, 

3. the conditional approach of Breslow and Day (1980), 

4. the usual regression model with binary dependent variable treated as a 
continuous variable, 
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5. the beta-binomial model and its extensions as proposed by Williams 

( i982 ), 

6. the correlated logistic-normal model of Laird and Ware (1981), and 

7. the correlated probit model of Ochi and Prentice (1984). 

The usual regression model is known to be inappropriate in the estima¬ 
tion of the parameters of the usual logistic model, and hence will fare even 
worse in estimating parameters of the correlated logistic model. Williams 5 
models allow the explanatory variables to be only cluster-specific, that is, 
they take on the same value for every member of the same cluster. The 
correlated logistic-normal model and the correlated probit model are quite 
different from Rosner’s model, and no easy comparisons are possible. 

The remainder of the paper discusses parameter estimation using the 
first three alternatives listed above. In particular, we examine the (asymp¬ 
totic) bias and variance of the estimators of the coefficient of a single binary 
explanatory variable when all clusters contain exactly two units. 

3. ASYMPTOTIC BIAS 

Under the unconditional model, the maximum likelihood estimator of all 
parameters of the model is consistent (and hence asymptotically unbiased). 
This is also true for the estimator of the regression coefficient under the 
conditional model. However, it is not true, in general, for the two other 
estimators. 

3.1 Usual Logistic Model 

Consider a sample of m independent clusters, each containing 2 units. 
The usual logistic model ignores the intracluster correlation and provides an 
explicit estimator of /?, namely, the usual log odds ratio 

ln[n 0 onii/(n 0 inio)], 

where is the observed number of units that have x = t and y = j, for 
i = 0,1 and j = 0,1, and may be considered as the observed value of a ran¬ 
dom variable Nij which is the sum of m independent random variables Rij, 
where Ri 3 can assume values 0, 1 or 2, according to the following probability 
distribution function: 
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Pr(i^y = 0) = Pr(X = i does not occur in same unit as Y = j ), 
Pr(jR^ = 1) = Pr(one of X\ and X 2 is « 

and the corresponding Y has value j) 

+ Pr(both Xi and X 2 are t 

and only one of Y x and Y 2 is j), 

Pr(i*o- = 2) = Pr(both Xi and X 2 are 1 and both Yi and Y 2 are j). 
Since each random variable Nij is the sum of m independent i^/s, 
E(N i3 /m) = EiRtj), 

so that as m —> 00 , 


^00^11 p ; E(Rqq)E(R ll) . . 

noi«io ” E(Roi)E(R 10 ) ' ' ' 

It may be shown (Koval and Donner, 1984) that each expectation in (2), 
that is, E(Rij), i = 1,2, j = 1,2, may be written as a simple function of the 
marginal distribution of X and the conditional distribution of Y. Hence the 
right-hand side of (2) becomes: 

Poo[P(0,0|0,0) + P(l,0|0,0)] + p 10 [P(0,Q| 1,Q) + P(1,Q| 1,0)1 
Poo[P(0,1 | 0,0) + P(l, 1 | 0,0)] + Pio[P(0,1 | 1,0) + P(l, 1 | 1,0)] 

w Pn[P(l,0 I 1,1) + P(l, 1 1 M)] + Pio[P(l,0 1 1,0) + P(1,1 | 1,0)1 
Pn[P(0,0 | 1,1) + P(1,0 | l,l)] + Pxo[P(l,0 | 1,0) + P(0,0 | 1,0)] 

which, when we assume Rosner’s correlated logistic model for the conditional 
distribution of Y, becomes 

c 0 [Poo^io(l + e Ql ) + Piodpo(l + e ai+ ^)] 

[Pio<iii(l + e ai ) + Pnd 10 (l + e*i+0)] 
x bn d io(e ai +e aa+ ^)+Piodn(e ai +e a »)] 

[piod 0 o(e ai + e“ 2 +^) + poodio(e ai + e a *)\ ’ ' ' 


where 


and 


d 0 o = l + 2e ai +e a2 , 

doi =di o = l + e ai + e ai+fi + e a,> + p , 
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du = \ + 2e ai+p + e“ 2+/? . 

The expression (3) is equal to e& when 

1. f) = 0, that is, when there is no effect of covariate on the dependent 
variable, or 

2. c *2 = 2ai, that is, when there is independence of units within the cluster. 

Hence the usual logistic estimator of /? yields a consistent estimator 
of if one of the above conditions hold; however, these are not necessary 
conditions. 

In order to investigate the size of the asymptotic bias of the usual logistic 
estimator of / 3 , and to determine other conditions under which this estimator 
is consistent, we introduce a symmetric correlated binary distribution for X. 
Assume that the marginal distributions of X\ and X 2 are identical to write: 

Pi = Pr(Xi = 1), * = 1,2, 

Po = Prpf,- = 0) = 1 - pi, t = 1,2, 

Pjk = Pr(*i = j,X 2 = k), j,k = 0,1. 

The joint probabilities may then be written in terms of p x in the following 
way: 

Pll = Pl(Pl + P*Po), 

Poo = Po(PO + PxPl), 

Pol = P 10 = PoPl(l - fix)- 

This binary distribution handles several special cases. Using this joint dis¬ 
tribution of X we can write the second part of (3), known as the inflation 
factor , as 


[dio(po + PxPi){l + e ai ) + d 0 opi(l — p g )(l + e ai+/? )] 

[dnPo{l - Px){ 1 + + dio(Pi + PxPi){l + e^+Z 3 )] 

x [diojpi + PxPo){e ai + e a *+P) + dnp 0 (l - Px)(e ai + e ag )] 

[cfooPi(l - Px){e ai + e a 2 +^) + dio(po + p x pi)(e a 1 + e a *)\ ‘ 

Table 1 gives values of this inflation factor for various values of p x , p y \ x 
and /?, where p y \ x is defined as (a + 6 + l) -1 . When a, b —► 00 , such that 
a/6 —► a constant, then p y \ x 0. In particular, we assume that a = 6 so 
that po = pi = 1/2. From Table 1 it may be seen that 

1. The estimator ln[(nooWn)/(noini 0 )] is consistent for (3 only if 

(a) /? = 0, 

(b) p y \ x = 0 (independence), or 
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Table 1. Inflation Factor for the Usual Estimator of the Odds Ratio 


Values of /? 

(followed by values of exp(/?)) 


Px 

Pv\x 

0.000 

(1.0) 

0.693 

(2.0) 

1.609 

(5.0) 

2.303 

(10.0) 

-1.00 

0.050 

1.000 

0.967 

0.935 

0.921 


0.200 

1.000 

0.875 

0.765 

0.719 


0.500 

1.000 

0.714 

0.500 

0.419 


0.800 

1.000 

0.579 

0.304 

0.209 

-0.50 

0.050 

1.000 

0.983 

0.967 

0.959 


0.200 

1.000 

0.935 

0.871 

0.842 


0.500 

1.000 

0.843 

0.692 

0.625 


0.800 

1.000 

0.756 

0.529 

0.432 

0.00 

0.050 

1.000 

1.000 

1.000 

0.999 


0.200 

1.000 

0.999 

0.994 

0.990 


0.500 

1.000 

0.995 

0.966 

0.941 


0.800 

1.000 

0.989 

0.918 

0.860 

0.10 

0.050 

1.000 

1.003 

1.006 

1.008 


0.200 

1.000 

1.013 

1.021 

1.023 


0.500 

1.000 

1.029 

1.035 

1.024 


0.800 

1.000 

1.044 

1.028 

0.990 

0.50 

0.050 

1.000 

1.017 

1.034 

1.041 


0.200 

1.000 

1.068 

1.138 

1.170 


0.500 

1.000 

1.179 

1.371 

1.458 


0.800 

1.000 

1.301 

1.653 

1.816 

1.00 

0.050 

1.000 

1.034 

1.069 

1.085 


0.200 

1.000 

1.143 

1.307 

1.391 


0.500 

1.000 

1.400 

1.998 

2.382 


0.800 

1.000 

1.726 

3.281 

4.778 
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(c) a very special combination of values of p x , p y \ x and /? exists. 

2. This estimator is asymptotically negatively biased when p x is negative 
or zero or has small positive values. 

3 . This estimator is asymptotically positively biased for positive values of 
p xy in particular, for values greater than 0 . 10 . 

3.2 Dummy Variables Model 

The logistic model for a single observation may be written as: 

/(y | x) = y[exp (7 + 0'x)]/[ 1 + exp (7 + p'x)\. 

Hence the log likelihood is 

l(l,P) = y(7 + 0'x) - ln [! + exp (7 + 0'x)]. 

In the dummy variables approach the logistic model is written with a differ¬ 
ent 7 parameter for each cluster, so that the log likelihood for the jth unit 
in the »th cluster is 


hjili, P) = y{li + P'xij) - ln[l + exp( 7 < + jtfxy)], 

The maximum likelihood equations for this model consist of two sets of 
equations: 

»• n* 

^ v Vij = ^ v fij > 1 — 1,..., m, (4) 

j=i j =1 

where f 

fij = exp(7i + P x»j)/[l + exp(7< + P x< 3 -)] 

and 

m rii m n t - 

X^ X^ VijXii = X^ £ f ijXij- ( 5 ) 

*=1J=1 »=1J=1 

Consider those clusters with all Y values the same. For those clusters 
with all Y’s equal to 1, equation (4) becomes 




n * 

3 = 1 

which implies that fij is 1 for every member of cluster % and hence that 

71 + -► 


00 . 
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Since this holds for all values of x t y, and we assume the values of 
fix, / 32 , •. *, fip are finite, it follows that 

7 i -*• ( 6 ) 

Similarly when all Y’s in a cluster are 0, then 

7 i -*• -oo. (7) 

Hence if all the F’s in a cluster are the same, we cannot use that cluster for 
the estimation of fi . We say that the cluster contains no information about 
the parameter fi, and may be discarded from the estimation procedure. In 
particular, in any numerical minimization, when either ( 6 ) or (7) or both 
occur, the solution to the minimization problem is considered unbounded, 
and no numerical value is returned for any of the parameter estimates. 

In the rest of this section we consider there to be only a single binary 
covariate, X , which takes values of 0 and 1 , and only two units per cluster. 

3.2.1 Cluster-Specific Covariates. In the case of a cluster-specific covariate, 
Xi has the same value x t - for every unit in the cluster; hence fa is the same 
for every unit in the cluster. Let us denote the cluster-specific fa by fi. From 
the preceding section we need only consider those clusters with exactly one 
Y equal to 0 and the other Y equal to 1 . For these clusters, equation (4) 
implies that fi = 1/2, so that x* = 0 => 7 * = 0 , but there is no information 
about fi. Similarly, x* = 1 => 7 * + fi = 0 , that is, 7 * = —fi. However, 
substitution into the second set of the maximum likelihood equations (5) 
does not produce a solution for fi. Hence cluster-specific covariates will not 
yield a maximum likelihood estimator of fi. Moreover, if the co variates are 
unit-specific, but in a specific cluster, the observed x’s are the same for all 
units, that cluster provides no information for the estimation of fi and may 
be discarded from the estimation procedure. 

3.2.2 Unit-Specific Covariates. By the arguments given in the preceding 
two sections we need only consider the clusters with dissimilar X values. Let 
us say that for cluster * we have Xu = 1 and x t2 = 0 . Equation (4) evaluated 
for this cluster yields 

1 = fa + f%2 > 

where fa and fa are the estimated probabilities, so that the equation may 
be written as 

x _ ex P(7> + P) + ex p(7«) 

1 + exp( 7 ,- + /?) 1 + exp( 7 i) ’ 
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which becomes 1 = exp (2^ + f3). This may be rewritten as 

1% = -t/2, ( 8 ) 

so that fii = exp(/?/2)/[l + exp(^/2)] = /, say, for all these clusters, and 
fi 2 = 1 - /. For the other clusters, that is those with Xu = 0 and x *2 = 1, 
we get fn = 1 - / and f i2 = f . 

Hence equation (4) evaluated over the clusters with both dissimilar x’s 
and dissimilar y’s yields 

rn = ri/, (9) 

where r\ is the total number of clusters with dissimilar x’s and dissimilar 
y’s and ru is the number of these clusters within which each unit has the 
same x and y value. Solving (9) for / and then solving (8) for /?, we get 

exp(/?) = (r n /r 10 ) 2 , 


where rio is the number of clusters in which each unit has an x value different 
from its y value. 

In a manner similar to that used for the usual logistic model in Section 
3.1, it may be shown that, conditional upon the total r± (= fn + rio), 

rn _p^ Pr(r 1 = l,r a = 0|X 1 = l > X a = 0) 
r 10 “ * Pr(Yk =o,y 2 = i | X 1 = 1,X 2 = 0)' 

which, under Rosner’s model, is 


exp(ai + 0) 
exp(aj) 


= exp(/?). 


Hence the dummy variables approach produces an estimator of exp(/3) with 
an inflation factor equal to the value being estimated. 


3.3 The Conditional Model 

Breslow et al. (1978) proposed a conditional logistic model for use in 
the analysis of case-control studies. In particular, for each cluster, they 
condition upon the total number of occurrences, that is, they condition upon 
all possible outcomes with the same number of Y *s equal to 1 and 0. If 
we consider the case of only two units per cluster and condition upon the 
existence of only (0,1) or (1,0) outcomes for (Yi, Yg), we get 

Pr(l,0 | xa,x 2 ) = exp(0'xx)/d, 

Pr(0,1 | Xi,x 2 ) = exp(/3'x 2 )/d, 
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where 

d = exp(/3'xi) + exp(0'x 2 ). 

This is the same function as obtained by Breslow and Day (1980, p. 248). 
According to Breslow and Day (p. 250), this conditional logistic model, in 
the case of a single binary explanatory variable, has the maximum likelihood 
estimator ln(rn/rio). 


4. ASYMPTOTIC VARIANCE 

Althv gh the asymptotic variance can be calculated for all parameter 
values, the initial case of interest is that with a single binary covariate where 
the true value of f) is 0, that is, when the covariate has no effect on the 
dependent variable. In this case, the correlation between Yi and Y 2 is the 
same regardless of the value of X\ and X %, and the following four estimators 
are consistent: 

1. the maximum likelihood estimator under Rosner’s (unconditional) 
model, /?r, 

2. the estimator under the usual logistic model, /? u , 

3. the estimator under the usual logistic model with dummy variables, /?«*, 
and 

4. the maximum likelihood estimator under the conditional (Breslow-Day) 
logistic model, $ c . 

We now compute the asymptotic variance of these estimators. 

4.1 Unconditional Maximum Likelihood Estimator 

For a single observation, (yi,y 2 ), with covariate values (xi,z 2 ), the log 
likelihood function may be written as: 

l(<xi,<X 2 , 0 ) = ai(yi + y 2 ) + Pfeiyi + x 2 y 2 ) + (<*2 - 2ai)yiy 2 - In d, 
where 

d = 1 + exp(c*i + f3x 1 ) + exp(ai + /3x 2 ) + exp(a 2 4- p{x x + x 2 }). 

By defining the following probabilities: 

fjk = Pr(yi = j, y 2 = k | X = x), 
fi+ — fio + Ai> 
f+i = foi + Ai, 
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one can calculate the second derivatives of the likelihood function with re¬ 
spect to Qi, c* 2 , and /? to obtain the upper triangular part of the matrix 
containing second derivatives as: 


/ (/io + /oi)(l ~ fio ~ foi) 


V 


-flliflO + foi) £i(/io/oo - foifu) \ 
+Z2(/oi/oO - fiofli) 

/n(l — fll) filial fo+ + ®2/+o) 
*i/i+/o+ + Z2/+1/+0 

+2XiX2(/ll “ /l+/+l)' 


In the case when /? is 0, taking expectation of the above matrix with respect 
to the distribution of X, and writing /oo as ho, fll as h 2 , and (/01 + fio) as 
hi, where 

hi = Pr(exactly i y's are 1), 
we obtain the following matrix: 


( hi(l-hi) -hihz 

— hih 2 h 2 (l h 2 ) 
/ihi(h 0 - h 2 ) 2/ih 2 /o 

V 


/ihi(h 0 -h 2 ) \ 

2/i/i 2 /o 
2i?(X 2 )/ 1 / 0 
+2£?(X 1 X 2 )[/i2 - A 2 ]/ 


We call this matrix 7, denoting the Fisher information matrix. Let be 
the ( 1 , j)th element of /, and I tJ the corresponding element of 7“*. We wish 
to obtain 7 33 which, when divided by m, is an expression for the asymptotic 
variance of the maximum likelihood estimator under Rosner’s correlated 
logistic model. Adapting the partitioning method while inverting the matrix 
7, 7 33 may be obtained as: 


I 33 = 2<7*[/o/i + p x (h 2 - /*)]. 


However, 

Pvl* = (h 2 - fMfoh) 

and hence 

I 33 = 2a®/o/l(l + Py\ x px)- 

This expression may be described as the information about f3 in a single 
cluster (when f3 is 0), and its inverse divided by m is the variance of /3r in 
a sample of m clusters, so that: 

Var(fo) = {2morlf 0 fi(l +PxPyfr)} -1 - 
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4,2 Usual Logistic Estimator 

If an incorrect model, namely the usual (or uncorrelated) logistic model, 
is used, estimates of (3 and a are obtained. When the true value of /? is 0, 
this estimator of /?, denoted by /? u , is consistent, but the estimator of a, 
denoted by a u , is, in general, not consistent. Regardless, we can calculate 
the asymptotic variance of j9 u under Rosner’s model, and compare this with 
variance of the true maximum likelihood estimator, /3 r. The methodology 
in this section was suggested by Kulperger (1979). An explanation is given 
in Appendix 1. Let ao and fto denote the values of the parameters to which 
the estimators, a u and /? u , approach asymptotically. When there is a single 
binary covariate, a closed form expression can be obtained for a u , namely, 

a u = ln(n 10 /n 0 o), 

where n*o is the number of Y’s equal to t when the corresponding A is 0. 
Using the methodology of Section 3.1, it can be shown that 

exp(a u ) = (n 10 /n 00 ) E(R w )/E(Ro 0 ), 

where the notation w as defined in Section 3.1. Assuming Rosner’s model to 
be correct, we can write this expression as 


dioPoo(e ai + e aa ) + doopio{e ai + e a2+ ^) 
dioPoo(l + e ai ) + dooPio(l + e ai +^) 

When j3 is 0, doo is equal to dio, and (11) becomes 

e ai +e aa 
l + e°u ’ 


(ii) 


( 12 ) 


hence ao is equal to the logarithm of this expression. 

The equations for obtaining the maximum likelihood estimators, also 
known as the estimating equations, are the following: 


TO 

^(y.i + y <2 - fa - fa) = 0, 

t=i 


TO 

^z*i(y»i — fn) + Xi 2 {yi2 - fa) = 0. 

%=i 

Let us denote the left-hand side of these equations by S m (a,/?). We require 
the expected value of S m (ao,/?o) with respect to Rosner’s model, that is 
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E(S m (ao, Po)). Since the clusters are independent, E( S m (ao,/?o)) can be 
written as the vector 


mE(y 1 + Y 2 - Fi- F 2 ) 

D = 

mE^X^Yx - Ft) + X 2 {Y 2 - f a )] 
When /? is 0, under Rosner’s model 

f ,«i 4. e <** 

E(Yi) = -——i-. 

v ' 1 + 2e“i + 

Each value of namely /i, is of the form 

exp(a + f)x)/[ 1 + exp (a + fix)\. 


However, at the value (ao> A) = 0)> this expression is 


e"°/(l + e“ 0 ), 


which is a constant. Substituting from (12), we have 


E{F 1 ) = 


P a l -i- #><*3 

- - -= E(Yi), 

1 + 2e 01 + e a * v ' 


so that Di is 0. For £>2 we evaluate E[X(Y — F)\ by taking expectation with 
respect to the conditional distribution of Y given X and then with respect 
to the marginal distribution of X, so that 


E[X(Y - F)} = E x [XE Y \x(Y - F)] 9 

but it was shown above that Ey\xiX “ F) is 0 so that E[X(Y — F 1 )] = 0. 
Thus S m (ao> A)) has expectation 0. Moreover it is a sum of m independent 
and identically distributed terms, each of which has mean 0 and variance V 
(to be calculated). Hence by the multivariate central limit theorem, 

J=S m (a,P)-^N 2 (0,V), 

where iV 2 is the bivariate normal distribution. 

The calculation of the elements of V is fairly straightforward. We obtain 


»u = Var {Y 1 + Y 2 -F 1 - F 2 ) 
= 2/o/l(l + Py\x), 



214 


J. KOVAL AND A. DONNER 


by the definition of p y \ x . To evaluate the other entries in V we use the rule 


Varxy = 2?xVary|x + Varx^r |x- 

In this case Ey\x °f both elements of S m (c*o,/?o) is 0; moreover, the distri¬ 
bution of Y is not influenced by x (when /?o is 0), so that Vary|x is actually 
Vary; hence 

v 2 2 = SxVary [Xi(Yi - F x ) + X 2 {Y 2 - F 2 )} 

= 2fof 1 [E(X 2 ) + p y \ x E(X 1 X 2 )} 

and finally 

t>12 = 2/ 0 /iA»x(1 + Py |»). 

Now we consider the asymptotic behaviour of the first derivative of 
S m (a, fi)/m as m approaches infinity. Let 9 denote the vector of param- 

A a 

eters, namely a and /?, and 9 denote the value of 9 at a u and /? u , and 9 0 
denote the value at ao and /?o- Then, 

1 dS m (9 ) 
m d9 

is a matrix U m of the following form: 

j \ ' / fil9il "b f%29%2 %ilfil9il “1“ x i2fi29i2 \ 

m “ \Xilfiign + Z% 2 fi 2 9i 2 Xiifiigil + *i 2 Ji29%2 ) * 


where 

9%j = 1 — f%j , i = l,...,m, j= 1,2. 
It follows that as m —► oo, fijg%j /i/o, and 


1 dS m (9) 
m d9 



E{X 2 ) 


)■ 


Let us denote the latter matrix by U. As stated in Appendix 1, the asymp¬ 
totic distribution of 


yfm 



is i\T 2 (0, U~ 1 VU~ 1 ). For the asymptotic variance of /? u , we need only evalu¬ 
ate the entry in position (2,2) of the matrix which can be shown 

to be 


(l + P»|»P.)/(2ff2/o/l)- 
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This is the contribution to the variance of j3 u by a single cluster. The variance 
of j3 u in a sample of m clusters is 

(1 + Py\xPx)/{ 2 mcrlf 0 f 1 ). 


4,3 Conditional Maximum Likelihood Estimator 

With a variation of an argument used by Farewell (1979) we show that 
the contribution to the asymptotic variance of ft c (when /? is 0), by a cluster 
with one Y equal to 0 and the other Y equal to 1, is 

2 /W(l - /».)]■ 

The log likelihood of the conditional logistic model may be written 


m = fax - In d, 

where 

d = exp(/?£i) + exp(/3x2). 
Differentiating /(/?) twice, we obtain 
d 2 l 

Jjp = ~ x xfl 9 l - *2/202 + 2Z1X2/1/2, 


where 

9 % = 1 ~~ /*> * = 

and fi is the conditional probability 

P{Yi = l\Xi = Xi), t = 1,2. 

Now when /? is 0, / t = gi = \, t' = 1,2, and we simply obtain 

£ (‘ 0 ) = ‘ , - <i ~' , - )/2 - <i3) 

This is the information in a single cluster in which there are dissimilar values 
of Yx and Y 2 . Clusters yielding similar values are discarded (see Section 3.3). 
In a sample of m clusters, on average, only 2 /01 of the clusters have dissimilar 
values of Y\ and Y 2 , and have the information described in the right-hand 
side of (13). The remaining m(l — 2/oi) clusters have similar values of Y , 
and hence contain no information about j3. The variance of a sample of m 
clusters containing both types of clusters is [m<7^/oi(l — P*)] -1 - 
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4.4 Dummy Variables Estimator 

In Section 3.2, it was shown that /?<*, the dummy variables estimator 
of /?, can be written as 2ln(rn/r 10 ), that is, as twice /? c , the conditional 
estimator. Hence the asymptotic variance of is four times that of /? c , that 
is, the contribution to the information about fi from a single cluster with 
one Y equal to 0 and the other equal to 1 is <7^(1 - p x )/ 8 , and the large 
sample variance of fid is 4/[m<7*/oi(l - p*)]* 

5. COMPARISONS AND CORRECTIONS 
5.1 Asymptotic Relative Efficiency 

To examine the asymptotic relative efficiency of various estimators when 
/? is 0, we consider the ratio of the asymptotic variance of the unconditional 
maximum likelihood estimator to the asymptotic variance of the other esti¬ 
mators. Let re0 u ) denote the relative efficiency of , the usual or uncor¬ 
related logistic estimator. By the formulae developed in Section 4, 

re(^u) = (l + Py\xPx ) 

Figure 1 shows plots of r e0 u ) against p y \ x for several different non¬ 
negative values of p x and Figure 2 shows plots for several different non¬ 
positive values of p x . In general, p y \ x is constrained to be non-negative 
in Rosner’s model, but p x may be either negative or positive. When p x is 
positive, re(/? u ) is less than 1, but when p x is negative, re(/? u ) is greater than 
1. In this case Pr, the unconditional maximum likelihood estimator, is not 
fully efficient. This does not violate the general principle of the optimality 
of the maximum likelihood estimator for a single parameter, because we are 
dealing with a multiparameter problem. 

In most family studies both p y \ x and p x are positive and small, in partic¬ 
ular, they are less than 0.5, so that in these types of studies, the minimum 
value of re(p u ) is 0.64. However, when both p y \ x and p x are quite small, 
say 0.1, re(/? u ) is 0.99, and little efficiency is lost by estimating j3 with p u . 
Moreover, for p x = 0.1, re(/3 u ) is always above 0.90. 

But p x can be negative, for example, in designed studies where p x is 
equal to or close to —1. In this case, re(^ u ) —► oo. However, for p x = —0.1, 
re(£ u ) never exceeds 1.21. 

The asymptotic relative efficiency of /3 C , denoted by re(j3 c ), is 

[/oi(l “ Prc)]/[2/o/l(l + Py\xPx )]• (14) 

It can be shown (see Appendix 2) that re(^9 c ) is always less than or equal 
to 1, with equality coming only when p x is —1, that is, when both x values 
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are dissimilar, one being 0 and the other being 1. Thus the conditional 
estimator of /? is only fully efficient in designed studies, for example in case- 
control studies. See Figure 3 for a plot of the asymptotic relative efficiency 
of j3 c against p y \ x for several values of p x . For p x = 0.0, re(p c ) is at most 
0.50 and falls to zero for increasing values of p y \ x . Even for p x = —0.50, the 
maximum value of re0 c ) is 0.75. 



Figure 3, Asymptotic relative efficiency of for various values of p x . 
Finally, the relative efficiency of the dummy variables estimator, denoted 

by re(A0, is 

[/0l(l “ /’*)]/[8/o/l(l +Py\xPx)\, 
that is, one-quarter of the asymptotic relative efficiency of j3 c . 


5.2 Second-Order Properties 


When the wrong model, namely the usual or uncorrelated logistic model, 
is used, the estimated variance-covariance matrix is based on the inverse of 
the matrix 



(15) 
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where 

m 

»=1 

As indicated in Section 4.2, as m —► oo, this matrix approaches 



By using the inverse of the matrix in (15), we are consistently estimating 

whereas we should be estimating Var(/9 U ), which, by the results of Section 
4.2, is 

(1 + Pv\xP*)H?olfoh)- 

This discrepancy corresponds exactly to that noticed for the continuous case 
by Scott and Holt (1982, p. 850). The factor (1 + p y \ x Px) is called the mis- 
specification factor of the variance estimator. Thus for the purpose of testing 
the hypothesis j3 = 0, one can use the output of a standard computer pro¬ 
gram to get /3 U and an estimate of Var(/9 U ), and then adjust by multiplying 
the estimate of var($ u ) by (1 + P y \xPx)> where p x and p y \ x are any consis¬ 
tent estimates of the intraclass correlation between the x’s and between the 
y’s; for example, the usual pairwise correlation may be used. Conversely, at 
least in this case, (1 + p y \ x Px) may be estimated by C, where C was defined 
by Brier (1980, p. 594). The approach in this paper gives a more general 
interpretation to Brier’s parameter C because C — 1, at least under the null 
hypothesis, is the product of the intraclass correlations of the dependent and 
independent variables. 


6. DISCUSSION 

When Rosner’s correlated logistic model provides the correct probabilis¬ 
tic mechanism for pairs of correlated binary outcomes in the presence of a 
single binary correlated covariate, three commonly used estimators of the 
regression coefficient have been shown, in general, to be lacking consistency 
and efficiency. 

1. The Breslow-Day conditional estimator, although consistent, may have 
very low asymptotic relative efficiency. 

2. The usual logistic estimator is consistent when the true regression co¬ 
efficient is zero, and in some other situations, and, when the regresion 
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coefficient is zero, this estimator may be more efficient than the uncon¬ 
ditional maximum likelihood estimator. 

3. The usual logistic model with dummy variables produces an estimator 
that is consistent only when the true regression coefficient is zero, and, 
in this case, has very low asymptotic relative efficiency. 

It would be of interest to see whether these conclusions hold for multiple 
covariates, continuous covariates, and mixtures of discrete and continuous 
covariates, and more than two units per cluster. 

This last point may be investigated using a version of the model in 
this paper expanded in a way suggested by Rosner (1983), that is, a model 
based on the beta-binomial with several units per cluster. However, of more 
interest is a model that would allow more complex correlation structure in 
logistic models, for example a model that allows second-order correlation 
different from Rosner’s (and hence the beta-binomial), or a model which 
allows unequal correlation between the units within a cluster. Such models 
would permit the modelling of quite complex patterns of correlated binary 
outcomes in the presence of covariates. 


APPENDIX 1: ASYMPTOTIC DISTRIBUTION UNDER 
AN INCORRECT MODEL 

The asymptotic distribution of an estimator, 9, in this case, the max¬ 
imum likelihood estimator, can be obtained by expanding the estimating 
equations, S m (0), in this case, the maximum likelihood equations, in a mul¬ 
tivariate Taylor series expansion about the correct value, 9 , of the parameter 
(see Cox and Hinkley, 1974, p. 294-302). Hence 

S m (i) = s m (0) + sum* - 9), (Al.l) 

where 5^(.) is the matrix of first derivatives of S m (0) with respect to 6. We 
are ignoring terms of o p (m -1 / 2 ). Since 

S m (0) = 0, 

(Al.l) becomes 
or 

- 9) = m[5i t (#)]- 1 [-S m (*)/VS;]. 

In the usual case where we have the correct model, by the multivariate 
central limit theorem, 

y/m 
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where I is the Fisher Information matrix. Moreover, 



so that 

y/m{0-0) N 2 { 0, r 1 ). 

However, in the case of an incorrectly specified probability distribution, 



where 0 o is usually not 0 , the value under the correct model, so that we must 
examine the behaviour of 

- So) = m[SH*o)]- l [8 m (9oW*i] (A1.2) 

under the correct model. 

If £7[S m (tf 0 )] = 0 and Var[S m (0o)] = V, then, by the multivariate central 
limit theorem, 

^S m (0o)^tf 2 (O,F), 

(where V is not necessarily the Fisher Information matrix for either the 
correct or the incorrect model). Now 

--SrnPo) $ U 
m 

(where U is also not necessarily the Fisher Information matrix), so that 
y/m{9 - 0 O ) -5* N 2 ( 0, U^VU -1 ), 
and the asymptotic variance of \fm§ is given by U~ 1 VU~ 1 . 


APPENDIX 2: ASYMPTOTIC RELATIVE EFFICIENCY 
OF THE CONDITIONAL ESTIMATOR 

The asymptotic relative efficiency of 0 C , as compared to #r, is given by 
(14) in Section 5.1 as 


[/0l(l ~ />x)]/[2/o/l(l + Py\xPx)\- 


(A2.1) 
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When fi is 0, 

foi — e ai /d, / 0 = (l + e“*)M fi = (e ai + e 2a ')/d, 


where 


d= l + 2e ai +e aa . 


Moreover, from (1) in Section 2, when /? is 0, 


Py |« = ( c “* 


-e 2ai )/[(l + e ai )(e ai +e ai )]. 


Substituting into (A2.1), we have 


(a \ _ rxj _ 

^ Pc) 2[(1 + e“i)(e ai + e“ 3 ) + p x {e a * - e 2 “i)] 

= (e ai + 2e 2otl + e a3+ai )(l — p x ) 

2[e«i + e2ax (! _ p% ) + e a 2 +a r + c a 2 ^ + \ * ) 

When p x is — 1, that is, when the two x values are always different from each 
other, then we have re($.) = 1. For other values of p x , we compare the two 
estimators by subtracting the numerator of (A2.2) from its denominator to 
get 


e ai (1 + p x ) + e a2+ai (1 + p x ) + 2e at2 (1 + p x ) 

= (1 + p x ){e ai -f e a2+ai + 2e a2 ), 

which is non-negative for all values of the parameters, and is 0 only when p x 
is —1. Hence the conditional estimator, when /? is 0, is never more efficient 
than the maximum likelihood estimator, and is as efficient as that estimator 
only when p x is —1. 
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INFERENCES ABOUT INTERCLASS AND INTRACLASS 
CORRELATIONS FROM FAMILIAL DATA 

ABSTRACT 

A brief review of estimation and tests of hypotheses concerning interclass 
and intraclass correlations from familial data is given. A modified likelihood 
ratio test is proposed for testing the equality of intraclass correlations in 
two multivariate normal populations. The asymptotic null distribution of 
the proposed test statistic is obtained. A test procedure based on Fisher’s 
z-transformation is also discussed. 


1. INTRODUCTION 

Estimation of the degree of familial resemblance concerning a certain 
characteristic, such as blood pressure, stature, body weight, lung capacity 
or I.Q., is of great importance in the biometrical study of inheritance. In¬ 
ferences are made on the basis of a sample of N families represented by 
X a = (Xo a , Xi a , X 2 a, • • • > -Xfc a ,a)'> a = 1,2,..., iV, each of which consists 
of the mother’s score Xo a > in general the parent’s score and his or her k a 
siblings’ scores X \ af ..., Xk at0t » This implies that each family may have dif¬ 
ferent numbers of siblings. We assume that there is no difference among 
siblings with regard to the attribute under consideration. 

A model based on this familial data is constructed by assuming that X a 
follows a ( k a + 1)-variate normal distribution with mean vector 

Ma = (/*m,M«)' 
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and covariance matrix 

S« = 

where <r ma = • • •, is a ^-dimensional vector 

and 

S sa = ^[(l “ P8s)Ia “1“ Pe« e a e a ] 

with I a the identity matrix of order k a and e a = (1,1,..., 1)' a fc a - 
dimensional vector. 

In the analysis of familial data there are two types of correlations: (i) a 
parent-child correlation and (ii) a sib-sib correlation. The degree of resem¬ 
blance between a parent and siblings is measured by the interclass correla¬ 
tion, pmay and the degree of resemblance among siblings by the intraclass 
correlation, p SB , Various kinds of estimators of the interclass and intraclass 
correlations have been proposed, some of which were obtained on a more or 
less intuitive basis while others are based on the method of maximum likeli¬ 
hood. A brief survey of estimation and tests of hypotheses that are relevant 
to interclass and intraclass correlations is given in Section 2. 

Another object of the present paper is to consider a test of hypothesis 
concerning intraclass correlations. In Section 3 a modified likelihood ratio 
test statistic is proposed for testing the equality of intraclass correlations in 
two multivariate normal populations. An improvement to a test procedure 
based on Fisher’s ^-transformation is also discussed. Large sample theory is 
applied to obtain some results on distributions of the test statistics. 


&m8 




(i.i) 


2. CORRELATIONS FROM FAMILIAL DATA 
2,1 Interclass Correlations 

One approach to estimation is to use the method of maximum likelihood. 
Unfortunately, an explicit solution of likelihood equations does not exist in 
the case where the number of siblings varies among families. So several 
algorithms have been proposed for finding maximum likelihood estimates of 
the interclass correlation p me . 

An algorithm proposed by Rosner (1979) involves iterative maximization 
of an implicit function of the parameters p me and p sa > and seems to be 
difficult to implement. A simpler algorithm was proposed by Mak and Ng 
(1981), using a linear model approach discussed by Kempthorne and Tandon 
(1953). Srivastava (1984) gave an alternative algorithm that requires solving 
only one equation, in which observations are transformed to simplify the 
covariance structure in (1.1). For this type of transformation, see Olkin and 
Pratt (1958) and Anderson (1963). 
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The disadvantage of the maximum likelihood estimates is that they can¬ 
not be expressed in closed form. Hence, several estimators have been pro¬ 
posed that use methods analogous to that of computing an ordinary product- 
moment correlation. These are called the pairwise, sib-mean and random- 
sib estimators, and were discussed in detail by Rosner et ai. (1977). In 
addition to these traditional estimators, Rosner et ah (1977) proposed the 
so-called ensemble estimator obtained by calculating an expected value for 
the random-sib estimator over all possible choices of random sibs from each 
family. 

These four estimators have been compared in terms of mean squared 
error, using Monte Carlo simulation (Rosner et a i., 1977) and applying large 
sample theory (Konishi, 1982). This work showed that the pairwise and 
ensemble estimators are far superior to the sib-mean and random-sib esti¬ 
mators, and that the pairwise estimator is more effective than the ensemble 
estimator at low values (< .3) of p 88 . Comparison of the estimators was ex¬ 
tended to include the maximum likelihood estimator by Rosner (1979); the 
maximum likelihood estimator was shown to be roughly equivalent to the 
pairwise estimator. It is known (Rosner, 1979) that the pairwise estimator 
reduces to the maximum likelihood estimator in the case of an equal number 
of siblings per family. Recently Sri vast ava (1984) gave two alternative sets 
of estimators for adding a slight modification in the process of deriving 
the maximum likelihood estimators. 

Asymptotic distributions of the pairwise and ensemble estimators were 
derived by Konishi (1982). He also showed that the sib-mean estimator is not 
a consistent estimator for p m8y and proposed a modified sib-mean estimator 
which is consistent. The asymptotic variance of the maximum likelihood 
estimator was derived by Rosner (1982). 

Regarding the interclass correlation p m8y it is of interest to test the 
hypothesis Ho : Pma = 0 versus H\ : p m8 > 0. The traditional 
test procedures—the classical pairwise, conservative pairwise and sib-mean 
tests—are based on the ^-distribution with appropriate degrees of freedom. 
These are analogous to the usual procedure for a correlation coefficient in 
a bivariate normal sample (for details, see Rosner et a I., 1979). However, 
the problem of assessing the aggregate degrees of freedom over N families 
of varying sibship sizes k a still remains. Rosner et a 1. (1979) suggested 
improvements on the ^-approximations of the classical and conservative test 
statistics and gave the adjusted pairwise test. Konishi (1982, 1985b) pro¬ 
posed alternative test procedures based on asymptotic distributions of the 
estimators. These compared favorably with the previous test procedures 
based on the studentized version of the pairwise estimator (see also Donner 
and Bull, 1984). 

It may also be of interest to test the hypothesis Hq : p m8 = po versus 
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Hi : Pme i= Po where po is a specified constant. For this testing problem, 
Konishi (1985b) proposed test procedures based on large sample theory to 
obtain asymptotic distributions of the pairwise and ensemble estimators. 

Confidence intervals for interclass correlations can be constructed using 
the close relationship between tests of hypotheses and confidence sets (see 
Konishi, 1985b). 

2.2 Intraclass Correlations 

In the model for the familial data discussed in Section 1, we consider only 
the scores of siblings, so that X a = (Xi a , X^u, ..., X^^) 1 , a = 1 , 2 ,... , N 
is assumed to have a fc a -variate normal distribution with mean vector p a = 
(/i a , Pe, . • • > PaY and covariance matrix E** with homogeneous variances and 
covariances. 

The maximum likelihood estimate of the intraclass correlation p 8S also 
cannot be expressed in a closed form. Algorithms for finding maximum 
likelihood estimates have been given by Donner and Koval (1980) and Smith 
(1980a,b). Donner and Koval (1980) compared the maximum likelihood 
estimator with previous estimators—the analysis of variance and pairwise 
intraclass correlations—using Monte Carlo simulation (also see Smith, 1957; 
Snedecor and Cochran, 1967). They suggested that if the value of p B8 is 
expected to be small or of moderate size (< .5), then the analysis of variance 
intraclass correlation is effective and if p 88 is thought to be large or if no 
prior knowledge about p ee exists, then the maximum likelihood estimator is 
effective. It is known (Elston, 1975) that if the number of siblings is the same 
in each family, then the pairwise intraclass correlation yields the maximum 
likelihood estimates for p 88 . 

Asymptotic distribution theory can be applied to test the hypotheses 
or to construct confidence intervals concerning the intraclass correlations. 
The results are messy, however, when the families have unequal numbers of 
siblings. 


3. A TEST FOR EQUALITY OF INTRACLASS CORRELATIONS 

We consider a test for the equality of intraclass correlations in two 
multivariate normal populations. Let X^^X^,.. .,X^ be a random 
sample of size N a from a p a -v ariate normal distribution with mean vector 
p a = (p a ,Pa, • • ,PaY and covariance matrix E<* (a = 1,2). Assume that 
E a has homogeneous variances and covariances, so that 

= ^a[( 1 -/>a)4 + Pae a e , a ] for a =1,2. 

On the basis of the observations, we wish to test the hypothesis 

Ho ■ Pi = P 2 (= p) 



CORRELATIONS FROM FAMILIAL DATA 


229 


where p is unspecified. For related work see Hahn (1975) and Gupta and 
Nagar (1986) and the references given there. 

3.1 Modified Likelihood Ratio Test 

One possible test for the equality of two intraclass correlations is the 
likelihood ratio test. Donner and Bull (1983) discussed the likelihood ratio 
test, which involves, however, an iterative maximization of the likelihood 
function. 

It is known (e.g., see Elston, 1975) that under the alternative hypoth¬ 
esis H i the maximum likelihood estimates of and p a for a = 1,2 are, 
respectively, given by 


.-1 


and 


£&»if Vfr-t r. - W 

££,4r’/p. 


(3.1) 


where 

s ( «> = (.<;>) = ± p (*<“’ - x„) (x‘“> - X.)' (3.2) 

for x„ = (X a ,x„ . x,y with X. = ^ E?:, xlf. 

Under the null hypothesis Ho we must maximize the likelihood function 
with respect to p, a\ and a\. It has been shown (Donner and Bull, 1983) 
that the estimate of the common intraclass correlation p is given as the root 
of a cubic equation and must be solved numerically. Hence we propose a 
modified likelihood ratio test statistic as follows: 


where 


and 


-2 log A = —2 log L(u>) + 2 log L(Cl) 

= E N a ( Pa log ^ + (Pa - 1) log (3 3) 

i lnr 1 + (P« ~ 1 ) r 1 
g 1 + (p a - 1 )r a J ’ 

r = E ~ 1 ) r “/ ( E J ' r “P‘«(P“ “ X )} ( 3 - 4 ) 

a=l la=l J 


A 2 — o 2 
~~ 8 a 


(p a - l)r(r a - r) 1 


L (1 r ){l + (p a — l)r>J 


for a = 1,2. 


Since the proposed test statistic does not, in general, have an asymptotic 
X 2 -distribution under the null hypothesis, we apply large sample theory to 
derive the asymptotic distribution of the test statistic (3.3). 
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Let N = Ni + N 2 and f a = N a /N for a = 1,2. It can be shown that 
^ an d gi yen by (3.2) converge to p a and 1, respectively, as N 

tends to infinity. So we set 


+ for i^j = l,2,...,p a 


^ = l + ^t;J“ ) for « = l,2,...,p a . 

Then we can expand r a and r in a power series of order 1/y/N as 


r a = Pc 




’ = Y. a «' , « + ~7 uY +iE + °r( N "'’I- 

a=l y ^ ot =1 a=l 

where a a = f a p a {p a - l)/{£a=i /aPa(p« - 1)} and 

i Pa ~ p “ 

v (“) = _1_W a) - 

1 n. fn. - 1^ *3 n. ** ’ 


Pa{Pa ~ 1) ; 
/ Pa 

(a) _ Pa I V- (« 




1 Par Pa 

_i_ 

»alPa i; , =1 


Substituting (3.5) in the modified likelihood ratio test statistic given by 
(3.3) and expanding the result in a Taylor’s series, we have, under the null 
hypothesis, 

i 1 *) a jl ( (1) (2)\ 2 

~ 2108 A= Ftf $ a+(p - . - - iw H —) ■ 


The variate t/j a ^ is a linear function of vj^ (t, j = 1,2,..., p a ) which are 
asymptotically jointly normally distributed with zero means and covariances 

^“ )S ] = 2, ^[vi; )2 ] = i+p 2 (^i), 

£ '[ v i“ )v i“ ) ] = 2 P (* ^ j), ^?[vi“ ) vifc ) ] = p( 1 + P ) (* t*0> 
^1;M: ) ]=2p 2 (^(M- 
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Hence it can be shown that under Ho the limiting distribution of — 
noting that and are mutually independent, is normal with 
mean 0 and variance 

^* = 2 E . ,<(> - >)*{»+(p. - m 1 - (s-6) 

Pa W Q " 

Then we have the following result. 

Theorem 3.1. Under the null hypothesis Ho, the asymptotic distribution 
of the modified likelihood ratio test statistic —2 log A as N = Ni+N 2 —■ ► +00 
is 

limLe(-21og A) = cx\, 

where x\ * s a chi-squared variate with one degree of freedom and the coeffi¬ 
cient c is given by 


_ tp 2 fctPot{Pot 

C "2(WF^{1 + Cp" - l)p>* 

with ip 2 given by (3.6) and ap by (3.5). 

In practice we have to estimate the unknown parameter p included in 
the coefficient c by (3.4) or by some other procedure. 

It is of interest to note that in the special case when pi = P 2 , the 
coefficient c reduces to 1 — f\ — f\ which is independent of the unknown 
parameter. Then we have the following result. 

Corollary 3.1. In the case when pi = P 2 , the asymptotic null distribution 
of —2 log A as N — ► +00 is 

limZe(—21ogA) = (1 - ft - /|)x 

where f a = N a /(Ni + N%) for a = 1,2. 

3.2 Fisher’s z-transformation 

In the particular case when = p 2 = P> say, we can use Fisher’s z- 
transformation 


Mr«) 



1 + (p ~ l)r a 
1 ~r a 


(3.7) 


where r a is the maximum likelihood estimator of p a defined by (3.1). It 
is known (Fisher, 1958, Chapter 7) that this transformation stabilizes the 
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variance for any value of dimension p, and that the asymptotic distribution 
of Z p (r a ) is normal with 


mean : Z p (p ot ) and variance : —--. (3.8) 

N a — 2 

However this normal approximation becomes poorer for the values of p 
greater than 2, since for p > 3 the variance stabilization and normaliza¬ 
tion cannot be achieved simultaneously by the same transformation (3.7), 
unlike the case of the correlation coefficient in a bivariate normal sample 
(Konishi, 1981, 1985a). 

It has been shown (Konishi, 1985a) that Z p (r a ) is asymptotically nor¬ 
mally distributed with 


mean: Z,M + I-.j-JLJEL-, and variance : ± 

This normal approximation is far superior to that of (3.8) and the second 
term of the mean does not depend on the unknown parameter. Therefore, 
for testing the equality of two intraclass correlations, we suggest using the 
statistic 

z p{n) ~ z p( r 2) 

which, under the null hypothesis H 0 , is asymptotically normally distributed 
with 


(7 — 5p) ( \ 1 \ 1 

: {18p(p-!)>■/’ \Jh~~Sl) SI + 
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MAXIMUM LIKELIHOOD ESTIMATION IN 
QUANTAL RESPONSE BIOASSAY 

ABSTRACT 

Approximate confidence intervals for the LD50 and LD90 doses and for 
the relative potency of one drug with respect to another in Bioassay are 
usually obtained using the Fieller method. A procedure is presented in this 
paper based on transformations of the parameters of interest, designed to 
facilitate normal, t, or log F approximations to the observed maximized 
or profile likelihood function. Comparisons with the Fieller method and 
applications to specific data sets are discussed. The procedure proposed 
illustrates methods and principles of statistical estimation that have a wide 
applicability. 


1. INTRODUCTION 

1.1 Logistic Model and Parameters of Interest 

A model frequently used in Bioassay is the linear logistic binomial model 
in which observations of a random count Y are considered for which the 
distribution 


/ (y; h, b 2 ) = Q) p v (! - p) k ", y = o, i,•••>*, (11) 

is assumed, where p = p{x ) is given by 

l°g[p/( 1 -p)] = 6 1 + &2a:- (1-2) 
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The random variable Y is the number of successes observed when an amount 
a; of a drug, usually measured by the log concentration, is administered to k 
individuals. 

Two parameters of interest are the LDlOOpo dose (0 < po < 1) and the 
relative potency. The LDlOOpo dose is the value x = 6 such that p(0) = po. 
For two drugs such that 

log bi/ (! - Pi)] = h + log [pi/ (1 -P 2 )] = b z + b 2 x, (1.3) 

the relative potency of the second compared to the first is defined by p 2 (i) = 
pi (a; + 7 ). From (1.2) and (1.3) 


0 — (to - 61 ) /& 2 > 7 = (&3 — &i) /& 2 > (1-4) 

where to = log [po/ (1 — po)]- Two specific LD’s are of principal interest: the 
LD50 and the LD90 obtained when po = 0.5, 0.9 (to = 0,2.1972), respec¬ 
tively. 

Assumption (1.3) is equivalent to assuming that (log) dose x of the sec¬ 
ond drug produces the same effect as (log) dose a:+ 7 of the first drug, for all 
x. In Bioassay experiments the first drug is called the standard preparation 
while the second drug is the test preparation. In terms of the original doses 
the above assumption is expressed by stating that the test preparation be¬ 
haves like a dilution/concentration of the standard preparation with dilution 
factor exp { 7 }. Although this quantity is often called the relative potency 
of the second drug with respect to the first, in this paper 7 will be referred 
to as the relative potency. 

The joint likelihood L(&i,& 2 ;y) of ( 61 , b 2 ) based on the numbers of re¬ 
sponses y = (t/i,..., j/a) observed on A groups of sizes k = (A?i,..., k^) to 
which the respective doses x = (xi,..., xa) were administered, is given by 

log L (&i, 62 ; y) = logp *' + X) (*• “ w) log C 1 - P<) (!- 5 ) 

where 

log [p»/ (! - Pi)] = h + b 2 Xi. (1.6) 


1.2 Fieller Method 

Both the LDlOOpo dose and the relative potency involve ratios of regres¬ 
sion parameters. The usual procedure for estimating ratios is based on the 
Fieller pivotal, discussed in some detail by Sprott and Viveros (1985). Its 
use in Bioassay is discussed and exemplified by Finney (1971, 1978). Here, 
however, the Fieller procedure is approximate, since none of the assumptions 
required when applied to the normal model is satisfied. 
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Noting that the LD50 dose 0 is defined by 6 X + 06 2 = 0 , and that the 
maximum likelihood (ML) estimate ( 6 i, 62 ) has an asymptotic bivariate nor¬ 
mal distribution about ( 61 , 62 ) with estimated covariance matrix V = I -1 , 
it follows that 


Z — ( 0 S 2 + 61 )/ (0 2 V 22 + 20vi2 + vn ) 2 (1.7) 

has an asymptotic standard normal distribution. Here, I is the observed 
information matrix with components 

hi = -d 2 log L {bi,b 2 ;y)/dbidbj,(i,j = 1 , 2 ). ( 1 . 8 ) 

Taking Z as a standard normal variate, setting Z 2 equal to a speci¬ 
fied value and solving the resulting quadratic equation yields approximate 
confidence intervals for 9. This is the Fieller procedure for estimating the 
confidence limits for the LD50 dose. Note that the denominator of (1.7) is a 
random variable, whereas the exact use of the Fieller pivotal requires it be 
a constant. 

1.3 Maximum Likelihood 

The purpose here is to examine the direct application of maximum like¬ 
lihood estimation to make inferences about 0 , 7 , valid and accurate in small 
samples. This will be done by the application of the methods developed by 
Sprott and Viveros (1984) and discussed in more detail for a related problem 
by Sprott and Viveros (1985). 

The analysis will be based on the maximized or profile likelihood for the 
parameter of interest. Suppose the observations y = (yi,t/ 2 ,*. . ,y&) £ S, 
S being the sample space, produce the joint likelihood L{9\ y) of the vector 
parameter 0 = (0i, 02 ), where 9\ and 02 may also be vectors. The associated 
maximized or profile relative likelihood of 0 i is defined as 


r m (0i;y) — L{9 1 ,9 2 \y)lL{9 u h\y), ( 1 - 9 ) 

where 02 = 02 (0i) is the maximum likelihood estimate of 0 2 for specified 9\. 
For more details, see Edwards (1972, p. 118) and Sprott (1980). 

2 . ESTIMATION OF THE LD50 DOSE 

2.1 Maximized Likelihood and Transformations 

The asymptotic bivariate normality of ( 61 , 62 ) would imply an asymp¬ 
totic log maximized relative likelihood function log iZjyf ( 0 ;y) = rjvr( 0 ;y) = 



238 


R. VIVEROS AND D. A. SPROTT 


— z 2 j 2 for the LD50 dose 0, which belongs to the family of likelihoods dis¬ 
cussed by Sprott and Viveros (1985), except that the denominator of z here 
depends on random variables. The skewness in i?Af(0;y) can be eliminated 
by the use of 


<f> = tan 1 [(0v 22 + V 12 ) / (vut>22 - ” 12 ) 2 ] ( 2 . 1 ) 

for which 

FS) = [e* log Rm{ 4>\ y)/dft\ / tj 2 (2.2) 

is zero for all y € S and t = 3,5,..., where 1$ = - d 2 log i?Af(^;y)/3^ 2 is 
the observed Fisher information of <f> based on i?Af(^;y). Thus R\f{<t>]y) 
will be a symmetric function of <j> with respect to 
In terms of <f> y the Fieller pivotal (1.7) is 

Z = Ij sin(0 - <j>) where 1$ = 63/^22 cos 2 (<£). (2.3) 

1 A 

If 1$ is sufficiently large, z == = I£ (<j> — <f>) in the region of non-negligible 

likelihood | z | < 3 and so <f> has an approximate normal likelihood centered 

at $ and scaled by 1^ 2 . By this is meant that i?Af(^;y) = exp{—u^/2} = 
exp {-I+it- 4>)*/2}. 

Since, however, in finite samples ( 61 , b 2 ) does not have a normal distribu¬ 
tion, Rm (^;y) is only an approximation to the maximized relative likelihood 
of 4> and Fi(4>), i = 3,5,7,... will not be exactly zero in general. Thus the 
observed y) arising directly from (1.5) will have to be examined. This 

can be done by writing the joint likelihood of ( 61 , b 2 ) as a function of ( 0 , & 2 )> 
yielding a likelihood L(0,b 2 ]y) proportional to (1.5) with ( 1 . 6 ) becoming 

log [p</ (1 - Pi)] = b 2 ( Xi - 0) . (2.4) 

Then the corresponding Rm(0]Y) is 

Rm{^y) = L{0,b 2 ;y)/L(0,b 2 \y), (2.5) 

where b 2 = b 2 (0) is the ML estimate of b 2 for specified 0 . 

The standardized third and fourth derivatives i^^) and jP 4 (^), which 
are the coefficients associated with the powers 3 and 4 of the Taylor series 
of *\/Vf ( <t>\ y) around 0, can be calculated numerically for any given sample 
y G S using the results of Sprott (1980). These quantities are the basis of 
the approximations discussed below. 
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As mentioned previously, both <f> of (2.1) and the denominator of the 
Fieller pivotal (1.7) arising from model (1.5) are random variables. Al¬ 
though this is probably not a serious drawback, strictly speaking the logic of 
the procedure requires them both to be fixed constants, as they were when 
applied to the normal parent distribution by Sprott and Viveros (1985). To 
avoid this in the case of (2.1), the transformation 

^ = tan” 1 ^(tU220 + W12) / (^11^22 ~ ^ 12 ) 2 ] (2-6) 

can be examined, where W = (X'X)^ 1 , X being the 4x2 matrix of l’s in 
the first column and the x+s in the second. 

2.2 Inferences Using Normal Approximations 

It can be shown that if Fs(^) and F^(^>) are small for every y E S 
then the Taylor expansion about <f> of rjvr(^;y) coincides up to the fourth 
power with that of the logarithm of the normal likelihood centered at 9 and 

1 I A 

scaled by I£ (see Viveros, 1985). In other words, u$ = I£{<f> — <j>) has an 
approximate standard normal likelihood for every y G S. Treating U# as 
an approximate standard normal variate (Welch and Peers, 1963; Sprott, 
1973), the information in the data about 9 can be summarized in the form 
of a complete set of approximate confidence intervals: 

<f> = <j> ± u/If, 9 = C!tan [<£±u//jj +c 2 , (2.7) 

where from (2.1) ci = (ti u t> 22 — v^) 2 /t>22' <?2 = —^12/^22' and u is the 
value of a N( 0,1) variate required to produce the desired confidence level. 
This is equivalent to linearising the Fieller pivotal as in Section 2.1. 

Example 1. Finney (1978, p. 365) gives the numerical data A = 4,x = 
(1.3863,1.7918,2.1972,2.6027),k = (12,24,24,10),y = (1,8,15,8). In this 
case bi = -6.1775, 62 = 3.0124, so that 9 = 2.0507, and the calculation of 
V and W leads to 

<t> = tan” 1 (3.0959 9 - 6.2109), * = tan _1 (2.2059 9 - 4.3997), 


from which 


<£ = 0.1358, $ = 0.1225, /^ = 13.0072, J* = 25.1578. 
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The quantities F$ and F 4 are 



0 

4 > 


*3 

0.1971 

-0.0301 

0.0507 

*4 

0.5818 

0.0047 

0.3034 


Since F$(<f>) and F 4 ^) are small, U<f, = I£ (^ —^) can be taken as an approx¬ 
imate standard normal variate in the proposed procedure. The summary of 
the data is, using (2.7), 

</> = 0.1358 ± 0.27727 u, 9 = 0.3230 tan(0.1358 ± 0.27727 u) + 2.0062, 

where u is the required value of a standard normal variate. 

The approximate 95% confidence interval obtained by setting u = 1.96 
is <f> = (—0.4076, 0.6792), 9 = (1.866, 2.267). Inserting these values of <f> 
into the Fieller pivotal (2.3) gives z = ± 1 . 86 , with a resulting confidence 
coefficient of 0.937. However, this also is only approximate (see Section 5). 

2.3 Inferences Using t Approximations 

If F^fjp) is not so small then a t distribution may be used for inferences 
about <f>. Suppose that for every y € S, Fs((j>) is small and 0 < F 4 {4>) < 6, 
and let 


T = q*Ut, q = n/(n + 1), n = [e/^tf)] - 1. (2.8) 

Sprott (1982) showed that the Taylor series of r^(<f>;y) around agrees up 
to the fourth power with that of the t( n ) likelihood centered at <j> and scaled 

by [gi^p*, that is, Rm{ 4 >\y) == {l + t 2 /n} where t — qiu# — 

[ql<l>]* (<f> — <j>) of (2.8) (see Viveros, 1985, for details). Equivalently, the 
quantity t has an approximate standard £( n ) likelihood for every y G S. The 
inferences about <f> (and hence about 6) can be obtained by treating T as an 
approximate standard t(„) variate. 

Example 2. In Example 1, Fs^) = 0.0507 but Fife) = 0.3034 is not very 
small. From (2.8) and the numerical results presented in Example 1, n = 19 
and the resulting summary of the data is 

# = 0.1225 ± 0.2045^(19), 9 = 0.4533 tan [0.1225 ± 0.2045t (19) ] + 1.9945. 
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The approximate 95% confidence interval obtained by setting t( 19 j = 2.093 
is ^ = (—0.3056, 0.5506), 0 = (1.851, 2.273), which is essentially the same 
as that obtained in Example 1. 


2.4 Inferences Using Log F Approximations 

Cases for which Fs(^>) is not so small are indicative of residual skewness 
in the likelihood of <f>. These may be handled by use of the skew log F 
distribution. Suppose F%(<f>) + Ft(<j>) > 0, and let 


H 

m 



2 /m + 2/n, 


4 jc 


c+ F< 


(*] . 


(2.9) 


where c = [ZF$ ($) + 22*4 (^)]£. Viveros (1985) showed that the Taylor series 
of rjvf(<£;y) about <j> agrees up to the fourth power with that of the loga¬ 
rithm of the log jF( m> n ) likelihood centered at 0 and scaled by [ql^^, that 
is, i?Af(<£;y) = exp(mh/2) {[1 + m/n] / [1 + (m/n) exp(h)]}^ m+n ^ 2 where 
h = qiu<f> = [ql<f>]*{<f> — 0) as in (2.9). Equivalently, the quantity h has 
an approximate standard log 2' 1 ( mj n ) likelihood for every y G S. This sug¬ 
gests treating H as a standard log F( mj n ) variate in obtaining approximate 
inferences about <f>. 

The form taken by the inferences will be 


<j> — 4 >+ [ql<f>\ 2 (hi, /12) > 

4> = c t tan j<£ + [?/*]■"* {h u /i 2 )} + c 2 , (2.10) 


where ci and c 2 are as in (2.7), and hi and h 2 usually selected according to 
the criterion of equal probability tails or of highest probability density for the 
log F distribution. The latter will be preferable since it automatically assures 
the approximate likelihood property of the resulting confidence intervals, 
that is, values of 6 inside the interval will have likelihoods greater than or 
equal to those of values outside the interval. These intervals will be referred 
to as likelihood confidence intervals. Notice that the use of a symmetric 
model, like normal or t, and the form of statement adopted in the previous 
sections gives rise to approximate likelihood confidence intervals. 

The use of the log F approximation will be illustrated with the estima¬ 
tion of the LD90 dose in Section 3. 


2.5 Remarks 

As seen in the examples presented, the confidence coefficients assigned 
by the Fieller method to the intervals obtained by the procedure proposed 
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here tend to be smaller than the stated ones. However, the Fieller method 
is also an approximation and, in fact, none of the assumptions required by 
that procedure is satisfied in this case. Moreover, Jarret et al. (1984) show 
that the Fieller method tends to produce wide confidence intervals in the 
sense that the resulting exact coverage frequency is larger than the stated 
one. These facts together with the conformity of the intervals obtained in 
reflecting the likelihood arising from the exact binomial logistic model is a 
good indication that the likelihood-based procedure proposed here may be 
more accurate. A more thorough discussion is given in Section 5. 

3. ESTIMATION OF THE LD90 DOSE 

The analysis in Section 2 can easily be extended to the estimation of the 
LD90 dose. The approximate Fieller pivotal (1.7) becomes 

Z = (biO + 6i — to'j / (0 2 t>22 + 20vi2 + vn) 5 , (3-1) 

where 9 now denotes the LD90 dose and t 0 = 2.1972; the transformations 
(2.1) and (2.6) remain the same, where V and W are as defined in Section 
2. As before, the maximum likelihood estimate $ as well as the observed 
Fisher information and the standardized coefficients F s (<j>) and F 4 (0) can 
be calculated numerically; the normal, t or log F approximations can be 
used in the same way as in Section 2. The corresponding calculations can 
be performed for parameter 4'. Consider the following example. 

Example 3. The following data arose from a medical experiment in which 
doses d = (2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0) mg/kg of the anaesthetic 
thiopentone were administered to 137 children allocated in A = 9 groups of 
sizes k = (15, 21, 20, 14, 12, 8, 11, 22,14) producing the respective numbers 
of responses y = (7, 14, 15, 13, 11, 5, 10, 22, 13). Each of these children was 
given a standard (fixed) dose of the drug TDP as premedication one and a 
half hours earlier, and the child was classified as responding if 20 seconds 
after the administration of thiopentone the eyelash reflex was abolished. 
These data were supplied by Brian Newman and are discussed, assuming a 
probit model, by Duncan et al. (1984). For more details see their paper. 

Fitting the binomial logistic model (1.1), (1.2), with n = logd<, the 
following results were obtained: 6i = -1.92005, b 2 = 2.78041, so that 6 = 
1.48082 [exp(0) = 4.3965]. The calculation of V and W yields 

<j> = tan -1 (3.0956 9 - 3.4135), * = tan -1 (2.8585 9 - 3.7977), 
from which 

4> = 0.8638, $ = 0.4105, = 35.7784, = 10.5663. 
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The standardized coefficients F$ and F± are 



9 

* 


Fs 

0.8553 

-0.3188 

0.0520 

*4 

-0.45998 

-0.0536 

-0.4882 


Consider the transformation <f>. The application of (2.9) yields an ap¬ 
proximating log F( mi n ) likelihood, centered at ^ and scaled by 
to the maximized likelihood of </> with m = 12 and n = 72 df, and 
q = (2/m + 2/n) = 0.1944. A plot of i?Af(^;y) and of the log F (m> n) , 

likelihood along with the normal likelihood centered at $ and scaled by 1^ ^ 
is shown in Figure 1. The log F approximation appears satisfactory while 
some residual skewness in the maximized likelihood of <f> makes the normal 
approximation less appropriate. 



Figure 1. Relative maximized likelihood Rm{^\ y) (solid line); log F likeli¬ 
hood (small dashes); normal likelihood (large dashes). 

The approximate inferences take the form 

<£ = 0.8638 + 0.3791 (h u h 2 ), 

0 = 0.32304 tan [0.8638 + 0.3791 (h u h 2 )] + 1.1027, 
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where hi and h 2 are values of a standard log F{ m , n ) variate required to 
produce the desired confidence level. For instance, the values (hi,h 2 ) = 
(-0.9859, 0.7998) lead to the approximate 95% likelihood confidence in¬ 
terval <f> = (0.4900, 1.1670) which in terms of $ is $ = (1.2750,1.8588) 
[exp(0) = (3.5787, 6.4160)] (see end of Section 2.4). It is interesting to note 
that the substitution of these values in the (approximate) Fieller pivotal 
(3.1) gives Z — (—2.1842,1.7861), yielding a confidence coefficient of 94.85% 
under the assumed asymptotic normality of Z. Although the confidence co¬ 
efficient is very close to the stated 95%, the use of the Fieller pivotal written 
in terms of <j> and its asymptotic normal distribution will produce a sym¬ 
metric ranking in <f> and hence will not reflect the skewness in the observed 
maximized likelihood. 


4. ESTIMATION OF THE RELATIVE POTENCY 
4.1 Fieller Method 

The estimation of the relative potency, although more complicated, 
introduces nothing essentially new. Suppose the numbers of responses 
Yi = (yii> ••• > JUi) are observed on A groups of respective sizes k x = 
(A:n,..., k A i) to which doses Xi = (xn,..., xai) of the standard prepa¬ 
ration are administered. Suppose, on the other hand, that when doses 
x 2 = ( 212 , • • •, Xfl 2 ) of the test preparation are administered to B groups 
of sizes k 2 = (ki 2 ,..., &b 2 ) the numbers of responses y 2 = (j/ 12 ,..., t/£ 2 ) 
are observed. The joint likelihood L(bi, b 2 ,b s ) = L ( 6 X , b 2) & 3 ;yi,y 2 ) of 
{bi,b 2 ,b 3 ) arising from these data is 

log L (bi, b 2 , t 3 ) = ^2 {Vn l°gP»i + (£.1 ~ Vh) log (1 - p,i)} 

+ ^2 {y /2 logpy 2 + (fc i2 - yj 2 ) log (1 - p j2 )} , (4.1) 

where 

l°g [Pil/ (1 “ P*l)] = f>i + & 2 £» 1 j 

l°g [Pj ' 2 / (1 — Pj2)\ = 63 + f>2Xj2- (4.2) 

A common way of estimating the relative potency (Finney, 1971,1978) is 
again by reference to the Fieller pivotal and its asymptotic normal distribu¬ 
tion arising from the asymptotic distribution of (Si, S 2 , S 3 ), the ML estimate 
of ( 61 , 62 ^ 3 ) calculated from (4.1). As is well known, ( 61 , 62 , 63 ) has an 
asymptotic multivariate normal distribution about ( 61 , 6 2 , 63 ) with covari¬ 
ance matrix usually estimated by V = I~ l where I is the observed Fisher 
information matrix with components 

Iii= ~d 2 log L (&!, b 2 ,b 3 ) /dbidbj, (i,j= 1,2,3). 


(4.3) 
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Noting that 7 is defined by the equation 61 + 762“^3 ^ 0, then proceeding 
as in Section 1.2 the Fieller pivotal for 7 is 

Z — (bi+ 762 - 63 ) / ( 7 2 A 2 + 27 A 1 + Ao) 2 , (4.4) 

where A 2 = V 22 , Ai = V 12 - V 23 and Ao = tin - 2 vi 3 + V 33 , with asymptotic 
standard normal distribution. 

4.2 Transformations 

The transformation that symmetrizes the maximized likelihood of 7 aris¬ 
ing from the asymptotic multivariate normality of (&i,& 2 >& 3 ) is 

<f> = tan -1 [(A 2 7 + ^ 1 ) / {MM - A\) 2 j (4.5) 

with inverse 

7 = j^(A 0 A 2 - A*) * /A 2 ] tan <f> - [Ai/A 2 ] . (4.6) 

The Fieller pivotal in terms of ^ is 

Z — j| sin(<£ - 0) where == & 2 /A 2 cos 2 (<£). (4.7) 

As in Section 2 the functional specification of </> involves random quantities. 
Although this is not of great concern, an alternative parameter that avoids 
this feature is 


* = tan " 1 [(B 27 + Bi) / (B 0 J3 2 - £?) *] , (4.8) 

where the B*’s are as the A*’s but calculated from the components of W = 
(. X'X ) _1 where 



Although the parameter ^ does not symmetrize the asymptotic maxi¬ 
mized likelihood of 7 , it can be as effective as <f> in reducing the skewness 
of the exact maximized likelihood of 7 , arising from the exact linear logistic 
model. This is discussed below. 
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4.3 Maximized Likelihood and Inferences Using Normal , t or Log F 
Approximations 

Writing b$ in terms of 7 , 61 , 62 as 63 = 61 + 762 and substituting in (4.2) 
gives 

log l/W (1 - Pj 2 )] = 61 + 62 (x j2 + 7) • (4.9) 

Then the joint likelihood of ( 7 , 61 , 62 ) is proportional to (4.1) with pn 
obtained from (4.2) and pj 2 from (4.9). The relative maximized likelihood 
of 7 is then 

Rm(t, y ) = L (7, h, h) /L (7, k, h) , (4.10) 

where 6 * = 6 ^( 7 ) is the ML estimate of 6 * for 7 specified, 1 = 1 , 2. 

Although (4.5) will not eliminate the skewness in Rm{4>] y) ^ will reduce 
it considerably. The information 1 4 based on i y) and the standard¬ 
ized derivatives and F±(<p) can be calculated numerically for every 

(yi, y 2 ) E £ as before. Then the normal, t, or log F approximations to 
Ritfifa y) can be applied as in Section 2, leading to approximate inferences 
about <f> which, as usual, can immediately be transformed into inferences 
about 7 by use of (4.6). 

Example 4. Table 1 presents data given by Finney (1971, p.108) arising from 
laboratory tests run on mice involving the drugs Morphine (standard prepa¬ 
ration) and Phenadoxone (test preparation). These data are also discussed 
by Kalbfleish (1983, p. 42). 


Table 1 * Data from Finney (1971) 
Standard preparation Test preparation 


Morphine 


Phenadoxone 


xi: 

0.18 

0.48 

0.78 

x 2 : 

c* 

t-H 

d 

1 

0.18 

0.48 

ki: 

103 

120 

123 

k 2 : 

90 

80 

90 

yi: 

19 

53 

83 

y 2 = 

31 

54 

80 


The maximization of (4.1) gives b\ = —2.2583, 62 = 4.0102 and 63 = 

—0.0359, so that 7 = 0.5542. The calculation of V and W leads to the 
transformations 

$ = tan~ 1 (2.1714 7 - 0.8933), * = tan“ 1 (2.0412 7 - 0.6124), 
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from which 

<£ = 0.3007, $ = 0.4786, = 107.8179, I* = 167.5808. 

The values taken by the standardized coefficients Fs and F 4 are 

7 <£ 

Fs 0.1638 —0.0153 -0.0798 

F 4 0.0625 0.0149 0.0341 


from which the use of a normal approximation to y) seems appro¬ 

priate as indicated by the small values of Fz(4>) and F 4 ((p). The form taken 
by the resulting inferences is 

<t> = 0.3007 ± 0.0963u, 7 = 0.4608 tan(0.3007 ± 0.0963u) + 0.4114, 

where u is the value of a standard normally distributed variate required 
to produce the desired confidence coefficient approximately. For instance, 
u = 1.96 gives the approximate 95% confidence interval <f> = (0.1119,0.4894), 
7 = (0.46297,0.6568). Inserting these values in the Fieller pivotal (4.7) 
gives Z = (—1.9485,1.9481), producing a confidence coefficient of 94.86% 
under the asymptotic standard normal distribution. The large size of the 
experimental groups makes the two methods agree in this case and both are 
probably very accurate. 

The use of and a normal approximation to ] y) gives approxi¬ 

mately the same results. 

5. ACCURACY OF THE PROCEDURE PROPOSED 

The calculation of the exact coverage frequency associated with the ap¬ 
proximate confidence intervals for the LD50 dose obtained by the procedure 
proposed here, or in fact by any other procedure, is difficult when four or 
more doses are used in the experiment. Simulations of the respective bino¬ 
mial distributions, however, can be performed to check in a limited way the 
accuracy of the foregoing procedures. 

Consider, for instance, the case A = 4 groups with x = (—3, — 1 ,1,3), 
k = (10,10,10,10). Take for the binomial logistic coefficients the values 
61 = —0.5493, 62 = 0.5493, the ML estimates arising from the observations 
y = ( 1 , 1 , 8 , 6 ) (see Kalbfleisch, 1983, p. 38), so that the LD50 dose is 0 = 1 . 
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Thus the respective success probabilities for the four groups calculated from 
(1.6) are p= (1/10,1/4,1/2,3/4). 

The results based on 2,000 independent samples generated on the com¬ 
puter using the IMSL subroutine GGBN are presented in Table 2. For each 
sample the approximate 90%, 95% and 99% confidence intervals for 9 were 
computed using the Fieller method, and using parameters <f> and # and the 
method proposed here. Then it was determined whether or not the value 
9 = 1 lay inside the interval, and the overall percentage of times in which 
9=1 lay in the interval was recorded for each of the three confidence coef¬ 
ficients. 


Table 2 


Nominal 

Coefficient 

Fieller 

Method 

Use of 

Use of 
* 

90% 

90.79% 

88.94% 

90.29% 

95% 

96.10% 

94.24% 

95.30% 

99% 

99.65% 

98.95% 

99.20 


These results reinforce the above mentioned feature that the Fieller 
method tends to underestimate the precision of the inferences. On the basis 
of Table 2, the method presented here appears to overestimate it somewhat. 
However, the inferences based on parameters <j> and ^ show a coverage fre¬ 
quency globally somewhat closer to the stated one than the coverage fre¬ 
quency associated with the Fieller method. In fact, the application of the 
likelihood ratio test for the testing of 

Ho : coverage frequency = 0.90,0.95,0.99, 

based on n = 2,000 trials and the observed frequencies shown in Table 1 
indicates that the observed coverage frequencies arising from the use of <f> 
and V do not differ significantly from 0.95 and 0.99 at the 5% level, while the 
Fieller coverage frequency does. The approximate p-values associated with 
the observed Fieller coverage frequencies calculated from the likelihood ratio 
test and its approximate chi-squared distribution with 1 degree of freedom 
are p = 0.019 and 0.0008 for the second and third rows respectively; the 
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corresponding approximate p- values for the observed coverage frequencies 
based on <f> and on # are p = 0.127, and 0.534 for the second row, and 
p = 0.83 and 0.35 for the third row. The approximate p-values arising from 
testing the first row are p = 0.23, 0.12, and 0.65, so that none of the three 
methods produces a coverage frequency differing significantly from 0.90. 

Thus, although all of the coverage frequencies may appear close to the 
required values, the Fieller frequencies differ significantly from these at the 
higher levels of .95 and .99. Further, it should be again emphasized that 
the intervals based on <f> and on take account of the asymmetry of the 
observed likelihoods of <f> and of while the Fieller intervals do not. 

6 . DISCUSSION 

The analysis in this paper, like that by Sprott and Viveros (1984, 1985), 
is based on transformations of the parameter of interest for which normal, t 
or log F approximations to the observed relevant likelihood are facilitated. 
When normal or t approximations are appropriate, the result is a complete 
summary of the evidence about ^ in the data in the simple form <f> = <£± st y 
where <f> is the maximum likelihood estimate of <£, s is related to its preci¬ 
sion, and t is the value of a standard normal or standard t variate required 
to produce the confidence level approximately. When this is not possible, 
because of residual skewness in the likelihood, the use of <j> = <j> + s(hi, h 2 ) 
is illustrated, where hi and h 2 are appropriate values of a standard log F 
variate. These statements are transformed into corresponding statements 
about the parameter of interest by straight algebraic substitution. 

This form of statement is put forward by Barnard (1985) as the object 
of estimation in preference to the widespread notion of estimation as an 
attempt to seek single numbers which are “closest” in some sense to the true 
value of the parameter. Because of the conformity of the approximate pivotal 
used here with the relevant likelihood, the resulting statements will reflect 
the shape of the observed likelihood function. Thus the numerical accuracy 
is combined with the likelihood property in the resulting inferences. 

More remains to be investigated. In particular, the determination of the 
exact coverage frequency produced by the estimation procedure presented is 
of interest, but due to the amount of programming involved it has been left 
for further consideration. 
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ESTIMATION METHODS FOR SYMMETRIC 
PARABOLIC BIOASSAYS 

ABSTRACT 

Two univariate methods of estimation in parabolic bioassays are re¬ 
viewed, and a new test for the assumption of “parallelism” is proposed. 
One method is extended into a multivariate situation based on a general 
linear model which generates a point estimator and approximate confidence 
limits for the associated relative potency. 


1. INTRODUCTION 

With linear response curves, parallel-line assays can be routinely ana¬ 
lyzed to estimate /i, the log relative potency of a test preparation and a 
standard preparation (see Finney, 1978 or Hubert, 1984). Even in a multi¬ 
variate situation, an explicit estimator of // exists when the response curves 
are linear (Carter and Hubert, 1985). 

However, occasionally significant curvature persists in a parallel-line as¬ 
say, even after restricting the dosage range atnd making suitable transfor¬ 
mations (Bliss, 1957; Elston, 1965). Often, a parabolic (quadratic) model 
will adequately fit the data over a limited dosage range (Bliss, 1957; Elston, 
1965). 

In this paper we concentrate only on the symmetric situation of parabolic 
response curves (for the general situation see Carter et al ., 1986). Note that 
these are dilution assays, implying that both parabolas are identical except 
that one is merely shifted horizontally with respect to the other. Therefore, 
the horizontal distance between the two parabolas is constant. It is this 
constant horizontal distance which is pt, the parameter of most interest. 

First, the estimation of by a weighted mean in the univariate case 

1 Department of Mathematics and Statistics, University of Guelph, Guelph, 
Ontario NIG 2W1 (all authors) 
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(Elston, 1965) will be presented and then expanded to the multivariate sit¬ 
uation. Model testing procedures will also be suggested. Next, estimation 
of // by the method of maximum likelihood in the univariate case (Williams, 
1972) will be given, along with a test for the horizontal “parallelism” of the 
parabolas. 


2 . THE UNIVARIATE PARABOLIC MODEL 

Let y denote the response variable and x (= log dose), the dose metame¬ 
ter scaled such that Y2 x % — let a, /? and 7 denote the model parameters 
in the assumed quadratic model for the standard preparation 

!/« = <*« + Ps* + Is** + ( 2 . 1 ) 


and for the test preparation 

yt = <x 3 + ft(z + /z) + 7 s(* + aO 2 + s t 

= &t + ftx + 7 tX 2 + et, (2-2) 

where: 

lt = ls = 7, 

ft = ft + 2 / 17 , (2.3) 

<*t - Ota + A*ft + /i 2 7> (2.4) 

and the e’s are iV(0,a 2 ) and are the error components. 

The equality of 7 1 and 7 * can easily be tested using regular regression 
techniques. 

With the reparameterizations 


a = a a + a t , ft = a t - a a , ft = (ft + ft)/2, 
ft = (ft - ft)/2, ft = 7, 

the restrictions (2.3) and (2.4) can be written as 

ft — fip 1 and ft = /zft 

and a general linear model in matrix notation will be 


Y = ®X + e, 


(2.5) 
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where: 


Y = 

[y*i Vs 2 • 

* • V&n tftl 

Vt2 • • 

• Vtn 

], 


* = 

[a Pi 6 

2 Pt\ , 






■ 0.5 

0.5 ••• 

0.5 

0.5 

0.5 • 

•• 0.5 


-0.5 

0.5 ••• 

-0.5 

0.5 

0.5 • 

•• 0.5 

X = 

= Xi 

x 2 ••• 

x n 

x i 

x 2 • 

’■ 


-Xi - 

~x 2 ••• 

~Xn 

Xl 

x 2 • 

• • x n 


■ x i 

x 2 ... 

x 2 

x 2 

T 2 

X 1 

X 2 • 

x 2 

•• X 2 


€ is a row vector of error terms which are N(0,a 2 ). 


3. ESTIMATION BY WEIGHTED MEAN (ELSTON, 1965) 

For symmetric parabolic assays, Elston (1965) suggested the following 
weighted mean estimator for the log relative potency parameter /i. An esti¬ 
mator of ® is 

* = YX'(XX')- 1 = [a di bi d 2 b 2 ] (3.1) 

and two independent estimators of fi are 

mi = di/bi and m 2 = d 2 /b 2 . (3.2) 

Taking the inverses of the asymptotic variances of these estimators, the 
following weights are obtained: 

Wi = b\f (ai + m\a 2 ) , 

«? 2 = b\f (a 3 + m|a 4 ) , 

where the a’s are elements from A = diag(ai a 2 a 3 04 ) which is the lower 4x4 
matrix of (X X 7 ) -1 . Therefore, the weighted mean of these two estimators 
is 

m = (wirrii + w 2 m 2 )/(wi + tt^)- (3.3) 

The weights should be recalculated using m in place of mi and m 2 and 
then iteratively calculate m and the weights until stability is reached. The 
variance of m is estimated by 

V{m) = V/f(w 1 + w 2 ) y 


(3.4) 
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where 

V = YY' - YX'fXX'y'XY' 

and / is the degrees of freedom for error. 

An asymptotic test for the validity of the two restrictions (2.3) and (2.4) 
is simply a two-sample z-test for the significance of the difference between 
mx and m 2 since the asymptotic distribution of these statistics is normal. 

4. A MULTIVARIATE MODEL AND MODEL TESTING 

Following the procedure of Carter and Hubert (1985), the estimators rrt\ 
and m 2 can be cast into a multivariate setting, where p characteristics are 
measured for each experimental unit. 

As in the univariate case, the equality of the quadratic regression pa¬ 
rameters must be tested. This is done by using Hotelling’s T 2 procedure 
(see Srivastava and Carter, 1983). 

Now the general linear model is 

Y = *X + c, (4.1) 


where: 

Y = (y«i y »2 • • • Yen yti yt 2 • • • ytn) where each y is a p vector of 
responses for the p characteristics, 

= [a fl x S 2 0 2 ], 

X is defined in (2.5), 

e is a matrix of error terms which are N p ( 0 , £). 

To test if p is constant over the p characteristics (i.e., 8 i = p/3 x and 
$ 2 = /^9 2 ), first calculate the 4x4 matrix 

G = T , Q , (fS)~ 1 QT, 

where 

© = («!>&,* 2 ,£ 2 ) 

and 

T = diag[*i t 2 £3 £4] = diag a 2 X ^ a z a 4 ^ 2 ] • 

Then from Chou and Muirhead (1979), reject the null hypothesis H : 61 — 

pPi if 

[f ~ {P~ !)/ 2 + ^r 1 ] Ml+40] > Xp-l;a> 
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where l\ > t 2 are the eigenvalues of the 2x2 upper left matrix of G. 
Similarly, we reject the null hypothesis H : S 2 = p/3 2 if 

[/ - (P - l)/2 + £3 1 ] [ln(l + £ 4 )] > Xp — i-a , (4.2) 

where £3 > £4 are the eigenvalues of the 2 x 2 lower right matrix of G. 
If p is found not to be constant over the p characteristics then univariate 
procedures need to be done for each characteristic. However, if p is constant 
over the p characteristics then it can be estimated as follows. 

5. MULTIVARIATE ESTIMATION BY WEIGHTED MEAN 
From Kraft et al. (1972), two explicit estimators of p are 

mi = (^2Ai) {011 — 022 + [(011 “ 022 ) 2 + 4 gl 2 ] ^ | /(2^12) 


and 

m 2 = {h/tz) |033 “ 044 + [(033 - 044) 2 + 4 gl 4 ] | /( 20 3 4 ). 

Using the inverses of the asymptotic variances of these two estimators the 
corresponding weights are 


rn = f 9 ii / (rn? + 01/02) 


and 

u>2 = fga/ {™\ + a 3 /a 4 ) 

and an overall estimator of p is given by the weighted mean 

m = (wimi -f 1 ^ 2 m 2 ) / (wi + w 2 ) . (5.1) 

The weights and m are iteratively calculated as in Section 3 until stability 
is reached. An approximate standard error of m is given by 

SE(m) = (u»! + u/ 2 )~ 1/2 , 

and an asymptotic confidence interval for p is 

»T» : F‘Sa/2SE(m), 

where z a / 2 is the (1 —a/2) X100 percentile of a standard normal distribution. 
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6 . ESTIMATION BY THE METHOD OF MAXIMUM 

LIKELIHOOD (UNIVARIATE) 

The parameter vector 5? can be rewritten as follows: 

«- = ru, 

where 

r = Mi fa] 

and 

1 0 0 0 0 " 

0 /x 1 0 0 . 

_0 0 0 /x 1 

Therefore the likelihood function is 

L = ( 2 na 2 )~ N/2 exp {- (l/ 2 <r 2 ) [v + ($ - rU)A _1 (® - TU)'] } , 

where N = 2 n and A = (XX , )“ 1 . If we let 

Q' = (A - 1 B' ty=[ qi <7*93 94 9 s], 

where B is the derivative of U with respect to /x and is explicitly 

“0 0 0 0 0 “ 

B = 0 1 0 0 0 , 

M 0 1 0 J 

then the maximum likelihood estimators of T and /x are 

f = 4fA^ 1 U , {UA~ 1 U , )~ 1 

and 

m — [®Q - (dgi + 0 iqs + 02<l5)]/0i<l2 + $2<l4), 

respectively. These solutions are iterative requiring an initial estimate of /x, 
which could be m i = di/ 61 , given in (3.1). 

7. LIKELIHOOD RATIO TEST FOR “PARALLELISM” 

The models being used here assume that the two parabolas are equiv¬ 
alent except for a horizontal shift of one with respect to the other (i.e., a 
dilution assay). Therefore the horizontal distance between the two parabolas 



SYMMETRIC PARABOLIC BIOASSAYS 


257 


is constant. This will be the definition of “parallelism”. The assumption of 
“parallelism” can be tested using a likelihood ratio criterion. 

Under the alternative hypothesis of “non-parallel” parabolas (i.e., with¬ 
out the restrictions (2.3) and (2.4) imposed on the model (2.2)), the maxi¬ 
mum likelihood estimator for cr 2 is 


*2 = 


V/N. 


Imposing the restrictions (2.3) and (2.4) on the model (2.2) (i.e., under the 
null hypothesis of “parallel” parabolas), the maximum likelihood estimator 
of a 2 is 

&l = [V + (fr - fUjA- 1 ^ - fU)'J / N. 


Using these two estimators in the likelihood ratio statistic, the criterion 
for rejecting the null hypothesis of “parallel” parabolas is to reject the null 
hypothesis at a significance level of a if 


N In [l + V-'i* - - TU)' 


xl 


;ot • 


8 . REMARKS 

The weighted mean method is computationally less complicated than the 
maximum likelihood approach and is only applicable to symmetric assays. 

The method of maximum likelihood is more general in that it can be 
extended to apply to asymmetric assays, cubic response curves and cases 
involving covariables. For a multivariate version of the maximum likelihood 
method see Carter et al . (1986). 

Williams (1973) and Armitage et al. (1974) have compared these two 
methods in the univariate situation. At present, we are investigating their 
behavior in a multivariate setting. 

Computer software has been developed for this univariate and multi¬ 
variate parabolic situation. This extends the existing programs developed 
for the quant al situation described by Carter and Hubert (1984) and the 
parallel-line situation described by Carter and Hubert (1985). 

REFERENCES 

Armitage, P., J. M. Bailey, A. Petrie, L. Annable, and M. P. Stack-Dunne (1974), 

“Studies in the combination of bioassay results”. Biometrics 30, 1-9. 

Bliss, C. I. (1957), “Bioassay from a parabola”. Biometrics 13, 35-50. 



258 


M. N. WALSH, J. J. HUBERT AND E. M. CARTER 


Carter, E. M., and J. J. Hubert (1984), “A growth-curve model approach to mul¬ 
tivariate quantal bioassay”. Biometrics 40, 699-706. 

Carter, E. M., and J. J. Hubert (1985), “Analysis of parallel-line assays with 
multivariate responses”. Biometrics 41, 703-710. 

Carter, E. M., J. J. Hubert, and M. N. Walsh (1986), “Analysis of multivariate 
parabolic bioassays”. To be submitted. 

Chou, R. J., and R. J. Muirhead (1979), “On some distributional problems in 
MANOVA and discriminant analysis”. Journal of Multivariate Analysis 9, 
410-419. 

Elston, R. C. (1965), “A simple method of estimating relative potency from two 
parabolas”. Biometrics 21, 140-149. 

Finney, D. J. (1978), Statistical Method in Biological Assay , 3rd edition. London: 
C. Griffin. 

Hubert, J. J. (1984), Bioassay , 2nd edition. Dubuque, Iowa: Kendall-Hunt. 

Kraft, C. H., I. Olkin, and C. van Eeden (1972), “Estimation and testing for dif¬ 
ferences in magnitude or displacement in the mean vectors of two multivariate 
normal populations”. Annals of Mathematical Statistics 43, 455-467. 

Srivastava, M. S., and E. M. Carter (1983), An Introduction to Applied Multivari¬ 
ate Statistics . New York: North-Holland. 

Walsh, M. N. (1985), “Analysis of parabolic assays”. M.Sc. thesis, University of 
Guelph. 

Williams, D. A. (1973), “The estimation of relative potency from two parabolas in 
symmetric bioassays”. Biometrics 29, 695-700. 



M. M. Shoukri 1 and P. C. Consul 2 


SOME CHANCE MECHANISMS GENERATING THE 
GENERALIZED POISSON PROBABILITY MODELS 

1 . INTRODUCTION 

A new generalization of the discrete Poisson distribution was introduced 
by Consul and Jain (1970, 1973) and studied by Charalambides (1974), Ku¬ 
mar and Consul (1979), Janardan and Schaeffer (1977), Janardan el al. 
(1979), Consul and Shoukri (1984, 1985), and many others. This model 
contains two parameters and possesses the over-dispersion and the under¬ 
dispersion properties which make it a good descriptive model for aggregation 
patterns in biology, ecology and many other disciplines. Although the nega¬ 
tive binomial distribution possesses the over-dispersion property, it is found 
that the parametric structure of the generalized Poisson distribution (GPD) 
permits a relatively simpler inferential procedure. The GPD model is given 
by 

P(X = x) = 0(0 + Xx) x ~ l e~ $ ~ Xx /x\ x = 0,1,2,... (1.1) 

and zero otherwise, where 0 > 0, | A | < 1 . When A is negative and m is the 
smallest positive integer for which 6 + mA < 0 , we set P(X = x) = 0 for all 
x > m. 

It was shown by Consul and Shoukri (1985) that for 0 < 0 < 1.5, the 
maximum error due to the above truncation is less than 0.5% when the 
number of non-zero class probabilities is limited to three. Further error 
analysis was made over various subregions of the parameter space of 6 and 
A < 0. As a general remark, it may be noted that the error due to truncation 
gets smaller when the number of classes is greater than four. The mean and 
variance exist for A < 1 and are given respectively by 

fi — 0(1 — A)” 1 , a 1 — 0(1 — A) -3 . 
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Ontario N9B 3P4 
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Janardan and Schaeffer (1977), and Janardan et ai (1979) fitted the 
GPD to about 100 different sets of biological and ecological data and have 
found that as a descriptive model, the GPD has provided an excellent fit 
in every case. Therefore, it is natural that one should wonder about the 
underlying chance mechanisms that generate this model. In this paper we 
show that there are many such mechanisms. 


2 . THE GPD AS A LIMIT OF THE GENERALIZED 
NEGATIVE BINOMIAL MODEL 


The generalized negative binomial distribution (GNBD) was defined and 
studied by Jain and Consul (1971) as a discrete probability model. The 
definition was subsequently modified by Consul and Gupta (1980) and the 
discrete probability model was given as 


P(X = x) = 


n 


n + fix L 


n + fix 
x 


a*(l-«)»+'*-, * = 0,1,2,..., (2.1) 


and zero otherwise and where 0 < a < 1 and /? = 0 or 1 < fi < ot~ x . 

The distribution was introduced as a queuing model by Mohanty (1966) 
and Takacs (1962). Recently, Hill and Gulati (1981) showed that the proba¬ 
bilities given by ( 2 . 1 ) are those of a gambler’s ruin in a random walk model 
associated with the game of roulette. 

When n and /? are very large and a is very small such that na = 9 and 
Pa = A, where 0 and A are finite constants, one can show, by employing 
Stirling’s approximation formula on the combinatorial term in (2.1), that 
the GPD model described in ( 1 . 1 ) is the limiting form of the above GNBD. 


3. THE GPD AS A FAMILY OF LAG RANG IAN 
PROBABILITY DISTRIBUTIONS 

Let g(t) and f(t) be two functions of which are successively differen¬ 
tiable and are such that </(l) = /(1) = 1 , </(0) ^ 0, and 0 < /(0) < 1 . Under 
the transformation 

u = t/g{t) (3.1) 

and for some value of t within the circle of convergence of f(t ), the power 
series expansion of f[t ), in powers of u, is given by Lagrange’s formula 
(Whittaker and Watson, 1963) as 

l*=o. 

x€T 


(3.2) 



POISSON PROBABILITY MODELS 


261 


where T is a subset of the set of non-negative integers and D = d/dt. 

Taking D x ~ 1 [(g(t)) x Df(t)] | t=0 > 0 for all x e T, Consul and Shenton 
(1972) and Consul (1981) defined a random variable X with a probability 
distribution of the form 

P{X 

Consul and Shenton (1973) discussed some interesting properties of this class 
of distributions, which they called “Lagrangian Probability Distributions” 
(LPD). 

Taking f(t) = exp[0(t- 1)] and g(t) = exp[A(t- 1)] in (3.3), and on sim¬ 
plifying the LPD, it reduces to the GPD given by (1.1). Also, the probability 
generating function (pgf) of the GPD becomes 

H(u) = (3.4) 

where = u. It may be noted that in this pgf of the GPD both the 

parameters 6 and A are positive. 




i)-u>-M(fwr0/(*)] it=o, 


xGT 

otherwise. 


(3.3) 


4 . THE GPD MODEL AS THE DISTRIBUTION OF THE 
TOTAL PROGENY IN A GALTON-WATSON 
BRANCHING PROCESS 

Historically, the study of branching processes began with the Galton- 
Watson process formulated by Galton to describe a possible mechanism of 
extinction of distinguished English families. The model of Galton and Wat¬ 
son was used by R. A. Fisher (1930) to study the survival of the progeny 
of a mutant gene. In ecological and biological applications, we consider a 
species of animal, insect, bacteria or gene that reproduces asexually. Each 
individual, before dying, reproduces a certain number of new individuals. If 
the reproduction process starts with a single individual, we are interested 
in (a) determining whether or not the species eventually dies out, (b) the 
distribution of the total progeny at the moment of extinction. While both 
questions are frequently answered in many standard textbooks on stochas¬ 
tic processes, explicit expression for the distribution of the total progeny 
was lacking, expecially when the reproduction process starts with a random 
number of ancestors. To illustrate, let us consider a problem which is of 
prime interest to government health officials. 

In many countries, tourism constitutes a fundamental source of income. 
In a particular country the tourists arrive at different destinations by sea, air 
and road through many checkpoints. Although international health cards 
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show vaccination has been completed, there is always a chance that a tourist 
may carry the bacteria of an infectious disease which spreads to other in¬ 
dividuals upon contact. Since the number of tourists arriving is very large 
and the probability of a tourist carrying the bacteria is very small, it is rea¬ 
sonable to assume that the number of infected tourists is a Poisson random 
variable Xq with pgf 

f(t) = E [t x °] = exp[*(t - 1)]. 

Here Xq represents the Oth generation of a branching process or the gen¬ 
eration of ancestors. For the validity of the model, we assume Reed and 
Frost’s hypothesis on the theory of epidemics, to be satisfied (Neyman, 
1963). Their hypothesis requires that every infected individual undergoes 
a fixed incubation period, becomes infectious for a short period of time, 
and then upon contact with a large number of people (drivers, guides, sales 
people, other uninfected tourists, hotel and restaurant employees, moviego¬ 
ers, etc.) may infect a random number of susceptibles. We shall assume 
that the probability distribution of the number of persons Z , who get in¬ 
fected among the susceptibles, is also a Poisson random variable with pgf 
g(t) = E(t z ) = exp[A(£ — 1)]. The newly infected persons will then form 
the first generation, and each one of them will generate offspring of infected 
individuals under the same conditions as stated above, and so on. The main 
objective is to find the probability distribution of the infected individuals up 
to the moment when the epidemic dies out. 

Let Xoy Xi ,... y X n ,..., denote the total number of infected individu¬ 
als in the Oth, 1st,.. .,nth,..., generations, i.e., they denote the sizes of 
the infected population at a sequence of points in time. One of the basic 
assumptions here is that the probability distribution of the number of in¬ 
fected individuals is the same. Although this assumption may be valid at 
the beginning of the epidemic, the situation may change if an anti-epidemic 
immunization campaign is launched. However, we shall assume the validity 
of the assumption throughout. Let y n (s) = E(s Xn ). 

If we assume that Xq = 1 with probability one, so that the process starts 
with a single infected tourist, then <7o(s) = s, and £i(s) = <?(s). Thus, for 
n = 2 , 3, ..., 

oo 

0n+l(«) = y>[*n + l = k] 8 k 
k=0 

OO oo 

k—0 j—0 

= £ppc„ = j] (,(*))• = »„(,(.)). 

3=0 
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Also 

ff 2 (s) = 91 ( g(s )) = g (ff(s)) = 9 (?i(s)) 

and similarly 

g n +i{s) = g(gn{s)). 

The above branching process will stop as soon as P[X n = 0] = 1 for some 
positive integer n. As is well known, the necessary and sufficient condition 
for this happening is that ^(l) = A < 1. 

Now let us assume that the process of spreading the infection stops after 

the nth generation so that Y n = Xq + X\ H- V X n (where Xo = 1). Also, 

let G n (s) be the pgf of Z n = Xi H-H X n . Thus 

Gi(s) = £i( s ) = g{s), 

and the pgf of Y\ is 

E [f yi ] = sfli(s) = Ri{s) (say). 

Since each of the X\ infected individuals will start a new generation, the pgf 
of Zi becomes 

G 2 (s) = g [<?<?!(«)] = g [i?i (a)], 

and similarly 


^n(^) —9 [-^n—1( 5 )] > n — 2,3,4,.... 
Thus, the pgf of Y n = 1 + Z n is 

R n (s) = sG n (s ) = sg [J? n _x(s)]. 
Taking the limit, as n increases without limit, 

G(s) = lim i2„(s) = sg f lim 2?„_i(s)l 

n—► oo Ln—^oo J 


and 

G( S ) = Sff (G(s)). (4.1) 

Putting G(s) = t in (4.1), we get t = s<7(t). 

Thus, the Lagrange expansion of t as a function of s may then be ob¬ 
tained using (3.3) for f(t) = t , to give the probability distribution of the 
total number of infected individuals starting with one infectious individual, 
as 

P[Y=j \X 0 = l] = Lyxy- 1 '-*, j = 1,2,..., (4.2) 
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which is known as the Borel distribution. If Xo = M with probability one, 
then f(t) = t M , and 


P[Y = j\X 0 = M} = 

j = M,M+ 1,..., (4.3) 

which is known as the Borel-Tanner distribution. 

In the case when Xq is a random variable with the pgf f(t) = exp[0 (t— 1)], 
as assumed and shown by us earlier, the probability distribution of the total 
number of infected individuals is obtained either by direct substitution for 
f(t) and g(t) in (3.3), or by multiplying (4.3) by the Poisson probability 


P [X 0 = M] = e~ e 9 M /M\ 


and summing over all values of Af. In either case the probability distribution 
of the total number of infected individuals becomes 

P(Y=j) = 6(6 + \jy- 1 (jq- 1 e- e - jX , j = 0,1,2,..., (4.4) 

which is the GPD model given in (1.1) with the parametric values 0 > 0 and 
0 < A < 1. 

Some Remarks: 

It may also be noted that there are two other possible choices of g(t ), the 
pgf of the probability distribution of offspring, which provide other forms of 
the GPD model. When 


9 (*) = exp [ a9(t - 1)], 
the GPD model (4.4) becomes 

P{X = z) = {l + ax)— 1 9*e- t l 1+am >/z\, x = 0 ) l,2,..., (4.5) 

and when 

g(t) = exp [0(t - 1)], 
the GPD model (4.4) becomes 

P(x = x) = (1 + *) a, - 1 0*e- , < 1+ *>/*!, X = 0,1,2,.... (4.6) 

The model given by (1.1) or (4.4) reflects the fact that the generation of 
ancestors is quite different from that of the immediate offspring, which seems 
quite reasonable in our example, especially when the 0th generation of the 
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epidemic consists of alien tourists. The probability of regeneration among 
the local population may be different because of a change in conditions, such 
as the resistance capacity of the individuals. 

Model (4.5) reflects partial similarity between ancestors and progeny 
and this may be explained by factors such as common inheritance. Model 
(4.6) will be applicable when the distributions of the ancestors and of their 
offspring are identical, as in the case of hybrid populations of certain species. 


5. APPLICATIONS OF THE GPD MODEL IN INDUSTRY, 
PSYCHOLOGY, SOCIOLOGY AND BIOLOGY 

By considering the problem of the spread of an epidemic, it has been 
shown in the last section that, under certain conditions, the GPD model 
can be based upon branching processes. The general hypothesis is that a 
particular characteristic is transmitted to some members of a group from a 
source either at an initial point in time or continuously. Then the individuals 
who have acquired the characteristic spread it to members of other groups. 
The new generation spreads the same characteristic again and the process 
continues again and again until either it dies out or else the whole population 
gets the particular characteristic. Accordingly, the GPD model can be used 
in all situations where the following conditions can be justified: 

(i) The total number of units or individuals in a group is large. 

(ii) The probability of acquiring the characteristic by a unit or individual is 
small. 

(iii) Each one of the units (or individuals) having the particular characteristic 
becomes a spreader of the characteristic for a short period of time. 

(iv) The number of members in the group, where each spreader (infectious 
individual) having the characteristic is likely to spread (infect) the char¬ 
acteristic, is also large. 

In each case, the number of individuals having the characteristic (i.e., 
the total progeny in the branching process) will be modelled by the GPD. 

Such conditions may exist in numerous aspects of human society and 
among other life forms. We shall consider a few important applications. 

(A) Spread of ideas and rumours . One individual transmits an idea or ru¬ 
mour, either by word of mouth or by some demonstration. The idea or 
rumour is picked up by a small number of individuals. Each one of them, 
including the originator, become transmitters like the first one and the 
process continues. Ultimately, the idea or rumour either dies down, i.e., 
becomes extinct, or spreads among everybody. 
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(B) Spread of fashion and sales. The effectiveness of an advertising campaign 

is measured by the total number of adoptions of the advertised product 
or fashion. For example, on viewing a television commercial advertising 
a new product, a random number of customers are generated whose 
pgf is f(t) = When other people see these customers buying 

the product or using it, some of them become customers. Thus, each 
original customer generates a further random number of customers, and 
the branching process continues until the process becomes extinct. Thus, 
each display of the commercial initiates a regenerative branching process 
phenomena for the adoption of the product. 

(C) Distributions of salespersons in “pyramid” dealerships , such as AVON, 
Amway , Tupperware and cleaning chemicals. Each distributor (infec¬ 
tious individual) initially collects a group of persons (susceptibles) and 
gives a demonstration or a show and encourages them to become sales¬ 
persons. Some of them enrol as salespersons (get infected). Then they 
repeat the same kind of demonstration or show and try to enrol others 
(infect them). The distributorships mushroom for a while and then die 
down. 

(D) Population counts of particular species of animals , such as insects, fish 
and frogs, in different areas. The conditions for the regeneration of all 
of these are very similar to those stated in (i) to (iv). The number of 
eggs laid by each individual is very large and the probability of an egg 
becoming an adult is very small since a high proportion of eggs, and 
immature offspring are eaten by other animals or get killed by other 
means. Again, the process of regeneration is a branching process. 

(E) Distribution of burnt trees in forest Gres. A single spark starts a fire 
which burns a tree. As the tree burns, numerous sparks are produced 
which fly out around that tree. Most of these sparks die out, but some 
of them develop into independent fires, burning other trees. Then the 
process continues as a regenerative process. Thus, all the conditions (i) 
to (iv) of the process exist which will provide the GPD model. 
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USES, LIMITATIONS, AND REQUIREMENTS OF 
MULTIVARIATE ANALYSES 
FOR INTERCROPPING EXPERIMENTS 

ABSTRACT 

Since summarization of data from intercropping experiments involves 
a linear combination of crop yields, it would appear at first glance that 
multivariate analyses would be ideally suited for this situation. Linear com¬ 
binations of values of crops, of total calories of crops, of total protein con¬ 
tents of crops, and of land utilization of crops are among some of the linear 
combinations which have considerable utility in summarizing data from in¬ 
tercropping experiments. From multivariate analyses, canonical variables 
based upon the criterion of maximum discriminating ability can be obtained 
when all crops are present in a mixture of crops. For t; crops taken A: at a 
time, such canonical variables are not obtainable using presently available 
theory. Even if they were, it is not certain that they would have any general 
utility for interpretational purposes. Areas of further research in multivari¬ 
ate analysis are discussed. These results are necessary in order to utilize 
multivariate analyses for intercropping investigations. 

1. INTRODUCTION 

Intercropping investigations involve the growing of two or more crops 
simultaneously or sequentially on the same plot of ground. These mixtures 
of crops have been found to be beneficial in several respects when compared 
to yield responses from crops grown alone, i.e., sole crops. One problem 
is how to combine the yield responses from several crops, which may have 
considerably different means and variances. For a farmer who considers crop 
values in monetary terms, one could put a monetary value on each crop and 
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obtain a total monetary value per hectare or per acre for each of the mixtures 
and for the sole crops. The form would be: 


V1C1 + V2C2 + V3C3 + • • • + VfcCfc, (1) 

where Vi is the value of the ith crop, C* is the yield of the ith crop in 
kilograms/hectare, say, and t = 1,2,..., A:, where k is the number of crops 
in the mixture. Such a linear combination as (1) would provide a useful 
statistic for a grower of crops. Likewise, if the farmer were interested in the 
total production of calories for his family needs, again we could use a linear 
combination of the form: 


c\C\ + C2C2 + C3C3 H-h CfcCjfc, (2) 

where c f is the calorie conversion coefficient for the ith crop. A similar linear 
combination could be used for protein conversion. 

If the farmer were interested in land use efficiency, he could use a linear 
combination of yield responses for each crop of the form: 

Ci/Cu + C 2 /C 26 + Cs/Csa H-b Ck/Ck 8 , (3) 

where C t * a is the yield of the ith crop when grown as a sole crop. The 
coefficients v* and c* would normally be taken as constants with the C* being 
the only random variables. Then if C u ... } C h have a A;-variate multivariate 
normal distribution, (1) and (2) each have a univariate normal distribution 
and no problems of statistical analyses are encountered over those found 
for sole cropping. If the C* a are also random variables, (3) would give a 
distribution which is a sum of Cauchy variables. However, if the C* a are 
constants, which might be the average yields of crops in that region for the 
past y years, then (3) also has a univariate normal distribution. 

Now crop values and yields fluctuate from year to year but the ratios of 
prices, Vi/v 1 and of yields, Ci a /Ci a , where Vi is the value for a base crop 
and Cu is the yield for a base crop, are relatively stable. Hence one could 
use J2viCi/v 1 in place of (1), X) c t C »/ c i in place of (2), and C l8 J2Ci/C i8 
in place of (3). Let us call these “relative crop values”, “relative calories”, 
and “relative land use”. In this form then, all linear combinations have a 
similar form, 

C \ + 62^2 + &3C3 + • • • + (4) 

The above is also the form for a canonical variable obtained from a multivari¬ 
ate analysis. As we shall see this form is useful when comparing the various 
forms with each other and especially with the canonical variable obtained 
from a multivariate analysis. 
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We first consider a multivariate analysis for p characteristics on crop t. 
There are no difficulties encountered here which are due to intercropping. In 
Section 3 the case of c\ lines of crop one and c 2 lines of crop two in all possible 
combinations is discussed. The yields of crop one are considered to be the 
first variable Xi and the yields of the second crop are considered to be the 
second variable X 2 . A series of interpretational problems are encountered 
for this situation. The present state of multivariate theory does not allow 
for solution of these problems and the basic utility of discriminant function 
analysis of variance (MANOVA), which is to obtain a canonical variable 
which has maximum discriminating ability, is questioned. In Section 4, the 
case of p crops in mixtures of k crops, k = 1,2,... ,p, is considered. Even if 
present multivariate theory were applicable in Section 3, it is not sufficient to 
handle the problems encountered in Section 4 even if k is not allowed to vary. 
Section 5 considers the situation for which there are c* lines for the ith crop 
and these are mixtures of k of p crops. Even if each of the line combinations 
from the nj;=i c % = v = total number of line combinations are considered 
separately, we are left with the difficulties encountered in Sections 3 and 4. 
Finally, some areas of research to extend present multivariate theory and 
analyses are discussed in the last section. 


2. MULTIVARIATE ANALYSIS FOR p CHARACTERISTICS 
OF ONE CROP 

When there are p characteristics or variables for a single crop, and pro¬ 
vided that the necessary normality assumptions are satisfied, standard mul¬ 
tivariate procedures may be utilized. The purpose here would be to obtain 
linear combinations (canonical variables) of the p variables which discrimi¬ 
nate among the treatments in the experiment. It would be hoped that one 
or two canonical variables would summarize the whole of the information for 
the p variables. Also, the relative importance of the p variables in discrimi¬ 
nating among the treatments could be assessed. 

Thus, when using multivariate procedures for the purpose described here, 
there appear to be no problems in summarizing information from data de¬ 
rived from intercropping investigations. For example, suppose that an inter¬ 
cropping experiment on maize-bean mixtures in combination and with maize 
and beans grown alone (sole crops) was conducted. Suppose that there were 
two lines of maize, X and Y, and four lines of beans, A, B, C, and D> then 
the 14 treatments in the experiment are given in Table 1. 

There are eight mixtures plus two maize sole crops plus four bean sole 
crops. If Mi is the total weight of maize grain for an experimental unit and 
M 2 is the number of ears per maize plant, there would be two characteristics 
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Table 1. Treatment Design for 
a Maize-Bean Experiment 


Maize 

A 

Beans 

B C 

D 

Sole 

X 

X 

X 

X 

X 

X 

Y 

X 

X 

X 

X 

X 

Sole 

X 

X 

X 

X 



for maize and a bivariate analysis could be performed for the ten treatments 
involving maize, i.e., the eight mixtures and the two sole crops for maize. 
Likewise, for the four variables measured on beans (Si is the number of 
beans per pod, B 2 is the number of pods per plant, S 3 is the grain weight 
of 100 beans, and S 4 is the grain weight per experimental unit), one could 
perform a multivariate analysis using the 12 treatments involving beans. All 
four variables or some subset of the four variables could be included in the 
multivariate analysis. 


3. MULTIVARIATE ANALYSIS FOR MIXTURES OF c x LINES 
OF CROP ONE WITH c 2 LINES OF CROP TWO 

As was stated in the introduction, a goal of statistical analyses for in¬ 
tercropping investigations is to obtain some linear combination of the yields 
from both crops in a mixture. If we let crop one yields be variable one, 
say Xi 9 and crop two yields be variable two, say X 2y a bivariate analysis 
of variance may be performed (see Pearce and Gilliver, 1978, 1979; Mead 
and Riley, 1981). As mentioned by the above authors, one serious problem 
here is heterogeneity of covariance matrices. The variance and covariance 
of a line of one crop may vary from line to line of the second crop. Thus, 
the lines of any one crop may, and often do, have heterogeneous covariance 
structures. To overcome this, it is suggested that single degree of freedom 
contrasts be made from the treatment sum of squares. (Also see Singh and 
Gilliver, 1983.) It may be necessary to compute an individual error variance 
for each single degree of freedom contrast. This can easily be accomplished 
for completely randomized and randomized complete blocks designs. 

One problem that appears to be unsolvable using presently available 
multivariate theory is how to compare sole crop yields for maize, say M a , 
with a linear combination of maize and beans in a mixture, say M m + bB m . 
If 6 is a ratio of crop values, for example, this can easily be done. If 6 is a 
ratio of yields of sole crop yields in farmers fields, say b = M a /B 3} then again 
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one can compare sole crop and mixture yields. However, if 6 is a coefficient 
obtained from a bivariate analysis to form a canonical variable, discriminant 
analysis is not sufficiently advanced to allow comparisons between M s and 
M m + bB m . This is a serious deficiency in the theory as it is usually desired 
to compare sole cropping with mixture yields in intercropping investigations. 

A second problem arising from the use of discriminant analyses for data 
from intercropping investigations is the usefulness and interpretation of a 
canonical variable of the form Af m + bB m . It appears that this canonical 
variable obtained from a bivariate analysis is useless for interpretation of in¬ 
tercropping data on mixtures of two crops. The criterion used to obtain the 
canonical variable, i.e., the linear combination producing maximum discrim¬ 
inating ability among treatments, has no meaning. To illustrate consider a 
set of data obtained from an experiment designed as a randomized complete 
block with four blocks and including the eight maize and bean mixtures de¬ 
scribed in Table 1. A bivariate analysis of variance is given in Table 2 . For 
the discussion, ignore the fact that data were missing for beans in one of the 
bean-maize mixtures. Carrying through the computations, it was found that 
b in Mm + bB m was 38. Now a meaningful b in terms of prices of maize and 
beans is between three and seven. A meaningful 6 in terms of land use, is in 
the same range. These values are not even close to the one obtained in the 
bivariate analysis. To put this into perspective, Figure 1 has been prepared 
where the canonical correlations are computed for various values of b. The 
canonical correlation is computed by dividing the treatment sum of squares 
for the canonical variable by the treatment plus error sums of squares. The 
formula is 


s = _ M tt + 2bC tt + b 2 B tt _ 

Mu + M ee + 2b(Ctt + c tt ) + b 2 (B tt + B ee y U 

where M tt and M et are the treatment and error sums of squares, respec¬ 
tively, for maize yields in the mixtures and similarly for B t t and B ee for 
beans, and C tt and C cc are the cross products of maize and bean yields for 
treatments and for error, respectively. The canonical correlation measures 
discriminating ability of the canonical variable. S has been computed for 
values of b in the range 0 < b < 100. The range of 3 < b < 7, or even 
2 < b < 12, has a practical interpretation, whereas values of b in the range 
of b > 12 have no meaning, throwing considerable doubt on the criterion of 
“maximum discriminating ability”. One other point of interest is the rela¬ 
tive flatness of the curve for S when b > 20, with a limiting value close in 
value to the maximum at 38. Because of this problem, it would appear that 
this type of multivariate analysis is inappropriate for analyzing data from 
intercropping experiments. One should also note that “maximum discrimi¬ 
nating ability” is not invariant with respect to differences among maize or 
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1 less one for first variable bean yields 
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Figure 1. Treatment/(treatment + error) sums of squares for 0 < b < 100. 

among bean lines. The multivariate approach taken by Pearce and Gilliver 
(1978, 1979) appears to be a fruitful direction for multivariate analyses in 
the future. 

4 . MULTIVARIATE ANALYSES FOR p CROPS IN MIXTURES 
OF k CROPS WITH ONE LINE PER CROP 

In the previous section, two crops were considered. There were ci lines 
of crop one and c% lines of crop two. Here we consider the case where there 
are mixtures of k crops from among p crops, k < p. Also, here we consider 
that there is only one line per crop. In the next section the case of c» lines 
on the *th crop is discussed. To illustrate, consider that four crops are to be 
included in an investigation wherein mixtures of three of the four crops are 
studied. All possible mixtures of three crops, for A being crop one, B crop 
two, C crop three, and D crop four, are given in Table 3. 

There also could be four sole crops, six mixtures of two crops, i.e., 
AB, AC, AD, BC, BD , and CD, and one mixture of four crops, ABCD. 
For the above discussed set of mixtures of three crops, let us designate the 
yield response on crop t as Xj, i = A, B, C, D. The measurements on the 
four variables (crops) in one of the r complete blocks for mixtures of three 
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Table 3. Mixtures of Three of Four Crops 
Mixture 

12 3 4 


A A A B 

B B C C 

C D D D 


Table 4. Individual Crop Responses 
For Mixtures in Table S 



Variable (crop) 

Mixture 


Xb 

x c 

x D 

i 

X 

X 

X 


2 

X 

X 


X 

3 

X 


X 

X 

4 


X 

X 

X 


are given in Table 4. 

In Table 4, “x” denotes that an observation for the variable is available 
and a blank means that no value for the variable was obtained. The design 
for the responses in this case would be a balanced incomplete block design 
with design parameters v* = 4, 6 * = 4, r* = 3, k* == 4, and A* = 2 . In 
computing a sum of squares for a variable rr* measurements would be 
used. To compute a sum of cross products rA* observations would be used 
for any pair of crops. Let Ta and Ea represent a treatment and error sum of 
squares for variables respectively, and let Tii and Eij, i ^ j — 1,2 ,..., p, 
represent the cross products between crops * and j . Testing in a multivariate 
analysis involves determinants of matrices composed of sums of squares and 
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cross products, of the form: 


E 


i p 


E t 


pp 


p + Ti p 

E pp + T pp 


En 

Ei p 


En + Tii 

Ei p + Tip 


(5) 


Since the Ea are computed from rr* responses and the E{j from rA* 
responses, one could use a conservative approach and multiply each Eu by 
A */r*. This would put sums of squares and sums of cross products on the 
same basis in order that a test could be used. Such a procedure would 
not utilize all the information in the data but would allow use of standard 
multivariate analyses. It should be noted that rr* and rA* can be quite 
different in value. For example, let v* = 25, k* = 5, r* = 6 , 6 * = 30, and 
A* = 1. For this case, A*/r* = 1 / 6 , or rr* = 6 rA*. Even for v* =7 = 6 *, 
k* = 3 = r*, and A* = 1, rr* = 3rA*. For u* = 27, r* = 13, k* = 3, 
6 * = 117, and A* = 1 , rr* = 13 rA*. The number of crops and the size of 
the mixture k* of crops determines the disparity in rr* and rA* values. 

At first glance, it might appear that the paper by Srivastava (1968) 
would be useful in analyzing incomplete responses such as described above. 
A study of the paper shows this not to be the case, and it is not known how to 
adapt the results for situations encountered in intercropping investigations. 
Srivastava (1968) considers the following situation. For t treatments in a 
block design on which p responses are to be studied, some or all of the 
responses are obtained for a subset of the blocks. To illustrate, let t = 4, 
r = 4 , k = 3 , 6 = 4 , and A = 2 be the design on which a set of responses is 
measured. Let the design on the responses also be a blocked design for the 
p = 3 response variables. For this design let Si = (Xi,X 2 ), S 2 = {Xi,Xs), 
S 3 = (X 2 ,X 3 ), and S 4 = {X u X 2y X 3 ), that is, two of the three responses 
are obtained in three sets of 6 blocks and all three responses are obtained in 
one set of 6 blocks. This would require a total of 4(4) = 16 blocks of size 
k = 3 experimental units. If each 5^, h = 1 , 2 ,3,4, is included twice as in 
Srivastava’s (1968) example, the plan in Table 5 would result. There would 
be a total of 32 blocks of size k = 3 experimental units each for a total of 96 
experimental units. The total number of observations would be 6 ( 2 )( 12 ) + 
2(3)(12) = 216. 

Purportedly the reason for not observing responses on all three variables 
on every one of the 96 experimental units wais cost or lack of time. It would 
appear that a more reasonable and desirable procedure would be to use 
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Table 5* Layout for Srivastava (1968) Example 


Set Variable 

Block 

12 3 4 


treatment treatment treatment treatment 

123 124 134 234 

Si Vi 
v 2 


s 2 Vi 
v 2 


s 3 Vi 

Vs 


s, Vi 

Vs 


Ss V 2 

V 3 


s G v 2 

V 3 


s 7 Vi 

V 2 

V 3 


Ss Vi 
v 2 

Vs 



one third fewer blocks and measure all variables on each experimental unit. 
Then one could use standard multivariate procedures and not run into the 
difficulties discussed in the Srivastava (1968) paper. 

Srivastava (1968) was interested in the treatments in the design. In in¬ 
tercropping the sets Sh are the treatments and are the items of interest in 
the investigation. Thus interest was on the column totals of Table 5, whereas 
the row totals of Table 5 represent the treatments in an intercropping inves¬ 
tigation. A further study of the procedures described threw no light on how 
to adapt the procedures of the paper for the situation encountered. 


5. MULTIVARIATE ANALYSES FOR p CROPS IN MIXTURES 
OF k CROPS WITH c* LINES ON CROP i 

Consider the situation wherein there are p crops which will be grown in 
mixtures of k crops and that there are lines on crop «, t = 1,2,... ,p, k < 
p, for the (£) combinations or some subset thereof, one could conduct k - 
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variate analysis in the manner described in Section 3 for one pair of crops. 
The extension would be straightforward. The difficulties discussed for the 
bivariate case would carry over to the fc-variate case. 

If one wished to use a combined analysis for all (£) combinations then all 
the difficulties described in Section 4 arise when one attempts a discriminant 
analysis. If the values v* in (1), c* in ( 2 ), or C a i in (3) are given or if the 
relative values 6 * in (4) are known, then there is no difficulty in forming such 
a canonical variable or a relative canonical variable. The only difficulty that 
would arise would be in obtaining a range of these values. For two crops, it is 
easy to vary b in X 1 + 6 X 2 . For three crops it is again not too difficult to vary 
b and c in X \, X\ + bX 2 , X\ -\- cX 3 , bX^ ■+■ cX 3 , bX 2 , cX 3 , and Xi + bX 2 + cX 3 
where there would be sole crops, mixtures of two crops, and a mixture of all 
three crops to compare. One could look at the various levels of 6 , say £ 2 , at 
various levels of c, say £ 3 . There would be £ 2 £3 levels to compare the various 
treatments (mixtures) in the experiment. With p crops there would be 
£ 2^3 - • • £p total levels at which to compare the treatments in the experiment, 
where £* is the relative level (crop value, calories, and/or land use) for crop 
i. This many computations makes presentation and interpretation difficult. 
Hence, £* should be as small as possible. For Figure 1 , we used 0 < £ 2 < 100 , 
but such a range of values would be extremely tedious if one used 0 < £ 2 < 
100 , 0 < £3 < 100, 0 < £4 < 100, etc. Therefore, it is recommended that 
at most three values for each £* be used, i.e., a low value, an average value, 
and a high value. For calorie or protein conversions, it may not be necessary 
to vary the £*. For land use and for monetary value, these values should be 
varied. 

6 . SOME UNRESOLVED PROBLEMS IN MULTIVARIATE ANALYSES 

One problem raised in the preceding discussion was the comparison of 
sole crops with mixtures of two, three, ..., the comparison of mixtures of 
two with mixtures of three, four, etc. One method of doing this might be 
the following. Let a* be the proportion of the variate in a multivariate dis¬ 
tribution with ]C£=i a % = 1- When the a* = a, we have present multivariate 
theory. If one were able to generalize present multivariate theory, this would 
appear to be one way of making the above comparisons, though it must be 
realized that they may not be useful in a practical interpretation of data 
from intercropping experiments. It would solve the problem of comparing 
canonical variables with different numbers of variables. 

The use of weight a* for variable Xi in a multivariate normal distribu¬ 
tion could be useful in analyzing replacement series data such as depicted 
in Figure 2. Here the proportion of one crop goes from zero to one while a 
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second crop goes from one to zero. If the effect of a mixture was simply sub¬ 
stitution of crops with no beneficial or no detrimental effect of the mixture, 
then the yields of the mixture would follow the solid line in Figure 2. If one 
developed the generalized multivariate normal as described above, it would 
be for this situation. 



\ 


Yield of 
crop two 


1.0 .75 .5 .25 0 

proportion of crop two 


Figure 2* Replacement series for two crops . 

On the other hand, suppose that the mixture yields followed the dashed 
line above the solid line. This would indicate a beneficial effect of mixing 
two crops, which is a usual result in intercropping if the crops are judiciously 
selected (Okigbo, 1981). If the curve were a smooth function, perhaps only 
one additional parameter in the generalized multivariate normal distribution 
would be required to handle the situation. A development of the theory for 
this situation would be useful. 
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A second line of development required in multivariate analysis is the 
one discussed in Sections 4 and 5. This has to do with the computation of 
sums of squares and cross products from unequal numbers of observations. 
A simple problem in this area is the distribution of the variance of the linear 
regression coefficient computed as /? = covariance (x, y )/variance (x), where 
the covariance (x, y) has Ni degrees of freedom and variance (x) has N 2 
degrees of freedom. For the multivariate situation let us consider a specific 
case for p = 3, for sole crop yields Xi S} X 2a , and X 3a , for a mixture of two 
crop yields Xi and X 2 , Xi and X 3 , X 2 and and for a mixture of all crops 
X l9 X 2) and X 3 . In a completely randomized design one may compute th' 
following matrix of sums of squares and cross products: 


■Si 

1 

1 

0 

0 ■ 

0 

1 

1 

1 

s 2 

0 

I 

. 0 

1 

1 

0 

I s 3 J 


where 


0 

Si 

Sy 


is a 3 X 3 matrix of zeros, 


= diag[Sn (sole),S 22 (sole),S 33 (sole)], 


’Su(j) 

S12U) 

Sis(j) 


S12U) 

S 22 {j) 

S23(j) 


s isU) 
£ 23 (i) 
^33(7) 


for j = 2,3; 


Si*( 2 ) is a sum of squares computed from the yields of crop t when it is in 
mixtures of two and computed from 2 r observations, S*v( 2 ), * 7 ^ i 7 , is a sum 
of cross products for crops t and i* computed from r observations, Sa( 3) is 
a sum of squares for crop i computed from mixtures of three and obtained 
from r observations, S^(3), « 7 ^ t ; , is a sum of cross products computed 
from the r observations, from mixtures of three, and 5 t * t *(sole) is a sum of 
squares computed from r sole crop yields in the r replicates. 

There are only three variates but there is a nine by nine matrix with 
the population variances for the nine diagonals being different. Also, there 
are six different parameters for the covariances. The mean for crop Xi 
changes from sole to mixtures of two to mixtures of three. For this situation, 
comparisons of sole crop yields with mixtures of two and of three crops and 
comparisons of mixtures of two crops with a mixture of three crops are 
desired. One has only a three variate problem which turns into a situation 
resembling a nine variate problem. 

In the above, if we considered only mixtures of two, the sums of squares 
would be computed from 2 r observations and the sums of cross products 
from r observations. Now what is the distribution of error over treatment 
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plus error sums of squares and cross products? One would be considering 
a quotient of the form given by equation (5). In general if there are (£*) 
combinations, a balanced incomplete block design can be created with pa¬ 
rameters v*, fc*, 6* = —A;*)!, r* = (v* — l)!/(fc* — l)!(t>*—A;*)!, and 

A* = (v* — 2)\/{k* — 2)!(v* — A;*)!. The sums of squares from crops in mix¬ 
tures of k would be computed from rr* observations and the sums of cross 
products are computed from rX* observations. How does the estimation and 
testing proceed for this situation? 

A third problem that arises is the interpretation of results from an inter¬ 
cropping experiment when equation (1), (2), (3), or (4) is used as the first 
canonical variable. Then, one computes a multivariate analysis and obtains 
canonical variables after (4), say, the computational procedure needs to be 
detailed as well as determining whether or not this procedure produces any 
useful results (see, e.g., Burnaby, 1966; Hanley and Parnes, 1983). 

In connection with the preceding, the criterion of maximum discriminat¬ 
ing ability arises. From the example given it would appear that a canonical 
variable arising from the criterion of maximum discriminating ability has no 
practical interpretation in intercropping investigations. The question arises 
as to what other criteria should be used in analyzing data from intercropping 
experiments. Should we continue to cling solely to the criteron set forth by 
Fisher (1936, 1938, 1940) and Smith (1936) or should multivariate theory 
proceed using other criteria? 

Another question that arises is the following. Suppose one has k canon¬ 
ical variables to summarize the information from p variables, k > p. Now 
should one do a second-stage multivariate analysis and treat the k canonical 
variables as a fc-variate problem? Can one now obtain a linear combina¬ 
tion of linear combinations and have a stage 2 canonical variable? Should 
one proceed to an s stage multivariate analysis? This problem arises in in¬ 
tercropping. Should one treat equation (1) as variate one, equation (2) as 
variate two, and equation (3) as variate three and conduct a three-variate 
multivariate analysis? 
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